intro

CPSC510 -- INTRODUCTION TO COMPILERS

LECTURE 1

Author: Robin Cockett
Date: January 8, 2014

WHAT IS A COMPILER
(Chapter 1: Louden & ASU)

Compiler as a translator:
                                         ___________
      source language --------| COMPILER |---> target language
                                        |___________|

A compiler is a program which reads in a "program" in a source language and translates it into a program (often executable) in a target language. We might expect that the program produced will be a "good translation" in some measurable sense ...

An interpreter (officially) is not a compiler. An interpreter reads in a source program and produces the result of running or executing that program. This involves actually executing the program in some fashion! The technology required for compiling is also required to produce an interpreter as usually an interpreter has internal abstract machine code which it can execute and into which the source programs must be compiled. Many modern programming language systems blurr the traditional distinction between interpreter and compiler as they support both!

Examples:

C, PASCAL, PL/1, FORTRAN, ... are usually compiled languages
SCHEME, PROLOG, MIRANDA, HASKELL, ML, ... are usually interpreted languages (most also have compilers)
JAVA, CLEAN compile to bytecodes which are then interpreted/compiled to specific architectures.

What should a compiler do?

Preserve the meaning of the original source code
Produce reasonably (space and time) efficient target code
Perform its job quickly! (order of the size of the program or O(n log n)...)

Compiling is not everything!

A compiler is usually just one step in a number of steps which lead to the production of "absolute machine code" ... however it is perhaps the biggest step!
            ________________         ___________     _______________          _________
          |                          |       |               |       |                      |       |                |
   ->-|    preprocess |-->--|   compile    |-->--|    assemble |-->--|   link       |->
          |________________|        |___________|       |_______________|       |_________|
source                      source             assembler              relocatable       absolute
                                                                                                             object code       machine code

THE STEPS OF COMPILING

Overview of a compiler:

                                         |
String                   _____|_______                                    ____
                              | Lexical     |                                |
                              | Analysis    |                                  |   Syntax:
                              | (Scanning) |                                     |
                              |____________|                                     |
                                         |                                               |    O(|Prog|)
Token list             _____|_________                                 |
                               | Grammatical      |                                |
                               | Analysis    |                                |
                               | (Parsing)    |                                |_____
                               |______________|
Syntax Tree                    |
(AST)                       |                                              ____
     ________          _____|______           _________   |
     | Symbol   |        | Semantic |     | Error    |    | Semantics
     | Table       |        | Analysis      |         | Handler    |    | O(|C| log |C|)
     |________|        |___________|        |__________|    |
                                        |                                           |___
Intermediate                  |
representation    |                                ____
    (IR)                              |                                           |
                               _____|_______                                 |
                               |   Optim-   |                                | Most of the
                               |   ization    |                                | problems here
                               |____________|                                | are NP-complete
(IR)                                |                                           |
                                ____|_______                                 | approximate
                               | Code           |                                 | algorithms
                               | Generat'n    |                               | are appropriate!
                               |___________|                                 |
Assembler                     |                                         |____

Let us follow an expression through a compiler:

position := initial + rate * 60

The first question one must ask is whether this is a valid expression
in the given source language. For this we need
(a) A formal definition of the language
(b) An effective membership test
(c) A plan for how to handle failure (i.e. error recovery!)

Usually the syntax of a languages is specified in two stages:

a specification of the lexemes of the language (that is the substrings of the input which will have special significance to the compiler) and their correspondence to the tokens ...
a specification of the grammar of the language based on tokens (obtained by translation from the lexemes) which are the terminal symbols of the language. This is usually done by providing a context free grammar.

STEP 1: Lexical analysis (micro syntax -- regular expressions)

Group characters together to form lexemes which are translated into tokens which are the terminals of the grammar used in the next step.


Lexeme	Token
"127"	INT(000000001111111)
"length"	ID("length")
"+"	ADD

This can be efficiently implemented using deterministic finite automata (DFA). These are often specified using "patterns" which are essentially regular expressions (we will eventually be using LEX ).

STEP 2: Grammatical analysis (macro syntax - context free)

The structure of the language is defined by a context free language and it is determined whether the stream of tokens belong to the grammar.

---- example grammar for mini language ----
prog -> stmt

stmt ->    IF expr THEN stmt ELSE stmt
             |   ID ASSIGN expr
             |   PRINT expr
             |   BEGIN stmtlist END

stmtlist -> stmt morestmtlist
|

morestmtlist -> SEMICOLON stmtlist
|

expr -> term moreterms

term -> factor morefactors
| SUB term

factor ->   LPAR expr RPAR
              |   ID
              |   NUM

morefactors -> MUL factor morefactors
| DIV factor morefactors
|

moreterms -> ADD term moreterms
| SUB term moreterms
|

Efficient parsers are built using pushdown automata (PDA) for LL(n) and LR(n) grammars use shift-reduce parsing. A very simple parsing technique is "recursive descent parsing" which requires an LL(n) grammar.

STEP 3: Semantic analysis (meaning --- type checking!)

Consider again the assignment

position := initial + rate * 60

Does this syntactically legal expression actually mean anything? How do we tell?

are "position", "initial", and "rate" declared?
what is "60"?
what is the meaning of "+" and "*"?
can "position"'s value be assigned in this way?

Suppose that all the variables are declared as real numbers then the meaning of "+" is "real addition" which will be executed rather differently than if all the declared variables had been integers and its meaning had been "integer addition". Furthermore, in the former case we will also have to turn the integer "60" into a real ...

The main component of semantic checking is checking that variables have been declared and type checking expressions.

At the end of this stage we know that we have a legal program and we generate an intermediate representation of the code. This representation may be sufficiently close to machine code that it facilitates the translation to machine code and yet sufficiently high level that it facilitates optimization. A typical intermediate code is "three address code". Each instruction in this code is a simple assignment consisting of an assigned variable a binary or unary operation and its two arguments. The above assignment might translate to:

                 temp1 := int2real(60)
                 temp2 := id3 * temp1
                 temp3 := id2 + temp2
                 id1 := temp3

Modern compilers use often higher level intermediate languages which have associated transformations which can be reduced to this form.

STEP 5: Optimization of intermediate code

The intermediate representation is optimized using program transformation techniques:

constant folding: expressions just involving constants can be evaluated at compile time (e.g. int2real(60) can be evaluated to 60.0)
elimination of unnecessary intermediate variables (e.g. temp3 is an unnecessary intermediary - can you find another).
common subexpression elimination
strength reduction of operations (e.g. multiplying by 2 changed into a shift of bits)
code movement out of loops ...

Code optimization is a major topic in itself! Some of this is the topic of CPSC510 where you are required to build a complete compiler with optimizations.

These procedures applied to the above code may have the following effect:

temp2 := id3 * 60.0
id1 := id2 + temp1

STEP 6: Code generation

Intermediate code is translated to a suitable assembler code. Here is the SPARC assembler code for this!

        ld      [%fp-16],%f4
        ld      [%fp-12],%f3
        sethi   %hi(.L_cseg2),%l0
        or      %l0,%lo(.L_cseg2),%l0
        ld      [%l0+0],%f2
        fmuls   %f3,%f2,%f2
        fadds   %f4,%f2,%f2
        st      %f2,[%fp-8]

************** USEFUL INFO! **************

To obtain this I actually wrote a short C program

main()
{
float position,rate,initial;

        initial = 12.0;
        rate = 3;
        position = initial + rate * 60;
        printf ("%f %f %f \n", position,initial,rate);
}

and then compiled it with the -S option. This produces assembler alongside the code. If you want to see how statements should be translated into assembler this reverse engineering technique is invaluable .... when you get to the code generation stage (next course CPSC510).

STEP 7: Assembler optimizations

In generating assembler there are a number of important issues:

instruction selection: there may be many different sequences of machine instructions which could implement a given language construct ... choosing the best way to implement the constructs may depend on the context ...
register allocation: if you can do the computation with lesss "spilling" (which forces store/load instructions involving main memory) this will produce better code. To achieve this involves deciding how best to use the registers available to you -- a well known NP-complete problem -- on your target machine.
load latency: when you load a value into a register from store because instructions are pipelined in most modern processors the value is not immediately available for use ... thus one may need to rearrange statements so that there are some operations between when the value is loaded and when the register is used. (On SPARCS if you fail to allow for this it may actually force a rollback of the pipeline which can actually be more expensive than if null operations had been inserted in the code!). Similarly on a branch instruction the next instruction will (depending on the machine) often be executed anyway: one can arrange that this instruction is a useful computation rather than a null.
peephole optimizations: sometimes it is worth performing optimizations after generating assembler to eliminate redundant code sequences (for example, a store of a value from a register followed by an immediate reload of the value into another register). A good compiler should not require this sort of post hoc optimization!

These are beyond the scope of this course see CPSC510.

QUALITIES OF A COMPILER ..

Correctness (does it preserve meaning -not as easy as it sounds but it is very important!)
Compiles quickly (complexity of compiling program O(n log n) ... remember bootstrapping!)
Output execution speed (how fast is the output code?)
Output footprint (how large is the code how much memory does it use?)
Separate compilation (relocatable code, linking)
Use friendly front-end: good error recovery.
Debugging facilities.
Cross language calls -- interface compatibilities.
Understandable and correct optimizations.