LECTURE 1
Author: Robin Cockett
Date: January 8, 2014
WHAT IS A COMPILER
(Chapter 1: Louden & ASU)
Compiler as a translator:
___________
source language --------|
COMPILER |--->
target language
|___________|
A compiler is a program which reads in a "program" in a source language and translates it into a program (often executable) in a target language. We might expect that the program produced will be a "good translation" in some measurable sense ...
An interpreter (officially) is not a compiler. An
interpreter reads
in a source program and produces the result of running or executing that
program. This involves actually executing the program in
some fashion! The technology required for compiling is also
required to produce an
interpreter as usually an interpreter has internal abstract
machine
code which it can execute and into which the source programs must
be
compiled. Many modern programming language systems
blurr
the traditional distinction between interpreter and compiler as
they
support both!
Examples:
A compiler is usually just one step in a number of steps which
lead
to
the production of "absolute machine code" ... however it is
perhaps the
biggest
step!
________________
___________
_______________
_________
|
|
|
|
|
|
|
|
->-| preprocess
|-->--| compile
|-->--|
assemble |-->--|
link
|->
|________________|
|___________|
|_______________| |_________|
source
source
assembler
relocatable
absolute
object
code
machine code
THE STEPS OF COMPILING
Overview of a compiler:
|
String
_____|_______
____
|
Lexical
|
|
|
Analysis
|
|
Syntax:
|
(Scanning)
|
|
|____________|
|
|
|
O(|Prog|)
Token list
_____|_________
|
|
Grammatical
|
|
|
Analysis
|
|
|
(Parsing)
|
|_____
|______________|
Syntax
Tree
|
(AST)
|
____
________
_____|______
_________ |
| Symbol
| | Semantic
| |
Error | |
Semantics
| Table
| |
Analysis
| | Handler
| | O(|C| log |C|)
|________|
|___________|
|__________| |
|
|___
Intermediate
|
representation
|
____
(IR)
|
|
_____|_______
|
|
Optim-
|
|
Most
of the
|
ization
|
|
problems
here
|____________|
|
are
NP-complete
(IR)
|
|
____|_______
|
approximate
|
Code
|
|
algorithms
|
Generat'n
|
|
are appropriate!
|___________|
|
Assembler
|
|____
Let us follow an expression through a compiler:
position :=
initial + rate * 60
The first question one must ask is whether this is a valid
expression
in the given source language. For this we need
(a) A formal definition of the language
(b) An effective membership test
(c) A plan for how to handle failure (i.e. error recovery!)
Usually the syntax of a languages is specified in two stages:
Group characters together to form lexemes which are translated into tokens which are the terminals of the grammar used in the next step.
Lexeme | Token |
"127" | INT(000000001111111) |
"length" | ID("length") |
"+" | ADD |
This can be efficiently implemented using deterministic finite
automata (DFA). These are often specified using "patterns"
which
are essentially regular expressions (we will eventually be using LEX
).
The structure of the language is defined by a context free language and it is determined whether the stream of tokens belong to the grammar.
----
example grammar for mini language ----
prog -> stmt
stmt -> IF expr THEN stmt ELSE stmt
|
ID
ASSIGN
expr
|
PRINT
expr
|
BEGIN
stmtlist
END
stmtlist -> stmt morestmtlist
|
morestmtlist -> SEMICOLON stmtlist
|
expr -> term moreterms
term -> factor morefactors
|
SUB
term
factor -> LPAR expr RPAR
|
ID
|
NUM
morefactors -> MUL factor morefactors
|
DIV
factor morefactors
|
moreterms -> ADD term moreterms
|
SUB
term
moreterms
|
Efficient parsers are built using pushdown automata (PDA) for
LL(n)
and
LR(n) grammars use shift-reduce parsing. A very simple
parsing
technique
is "recursive descent parsing" which requires an LL(n) grammar.
Consider again the assignment
position := initial + rate * 60
Does this syntactically legal expression actually mean anything? How do we tell?
The main component of semantic checking is checking that
variables
have been declared and type checking expressions.
At the end of this stage we know that we have a legal program and
we
generate an intermediate representation of the
code. This representation may be sufficiently close to
machine
code that it
facilitates the translation to machine code and yet sufficiently
high
level
that it facilitates optimization. A typical intermediate
code is
"three
address code". Each instruction in this code is a simple
assignment
consisting of an assigned variable a binary or unary operation and
its
two
arguments. The above assignment might translate to:
temp1
:=
int2real(60)
temp2
:=
id3 * temp1
temp3
:=
id2 + temp2
id1
:=
temp3
Modern compilers use often higher level intermediate languages
which have associated transformations which can be reduced to
this form.
The intermediate representation is optimized using program transformation techniques:
These procedures applied to the above code may have the following effect:
temp2
:=
id3 * 60.0
id1
:=
id2 + temp1
Intermediate code is translated to a suitable assembler
code.
Here
is the SPARC assembler code for this!
ld [%fp-16],%f4
ld [%fp-12],%f3
sethi
%hi(.L_cseg2),%l0
or %l0,%lo(.L_cseg2),%l0
ld [%l0+0],%f2
fmuls
%f3,%f2,%f2
fadds
%f4,%f2,%f2
st %f2,[%fp-8]
To obtain this I actually wrote a short C program
main()
{
float
position,rate,initial;
initial = 12.0;
rate = 3;
position = initial +
rate *
60;
printf ("%f %f %f \n",
position,initial,rate);
}
and then compiled it with the -S option. This
produces assembler alongside the code. If you want to see
how
statements should be translated into assembler this reverse
engineering
technique is invaluable .... when you get to the code generation
stage
(next course CPSC510).
STEP 7: Assembler optimizations
In generating assembler there are a number of important issues:
QUALITIES OF A COMPILER ..