Compilers and How They Work: An Overview

                         Lou Morgan
                Madison IBM-PC User's Group

What  are compilers  and how  do  they work?  Many  computer
users ask themselves this  question  after  the  programming
bug  has  bitten  them.  To most people,  a  compiler  is  a
"black box program" which takes source code  written in some
high-level language,  such as FORTRAN, BASIC, Pascal,  or C,
and translates (compiles)  it into a  language  the computer
can understand and  execute.  Compilers take source code and
transform  it into virtual machine  language.  With the  IBM
PC, this virtual language is 8088 machine code.

Compilers vs. Interpreters

Computers  cannot  understand  English  words  and  grammar.
Even  the   highly   structured   words  and  sentences   of
programming languages  must be translated before  a computer
can understand them.  The compiler  or interpreter must look
up  each "word" of your  programming language  in a  kind of
dictionary  (or  lexicon)  and,  in   a   series  of  steps,
translate  it  into machine  code.  Each  word  initiates  a
separate logical task.

An interpreter translates one line of source code at  a time
into  machine  code, and then  executes  it.  Debugging  and
testing  is  relatively   fast  and  easy   in   interpreted
languages,  since the entire  program  doesn't  have  to  be
reprocessed each  time  a change is  made.  The BASIC(A).COM
program is  an  interpreter.  Interpreted programs run  much
slower  than   compiled  programs,   because  they  must  be
translated each time they  are run.  Programmers  often test
and  debug  their  programs  using an interpreter  and  then
compile them for production use.

How Compilers Work

Most compilers convert programs in three  steps.  Each  step
is  called a  pass.  A  particular  compiler  may  have  one
program  per  pass, or may  combine two or  three steps in a
single program.  For a very complex  language, a step may be
so  difficult to  perform that it  is  broken up  into  many
smaller  steps.  Regardless of how many passes  or  programs
are  required,   the  compiler  performs  only   three  main
functions:   first,   lexical   analysis;   second,   syntax
analysis; and third, code generation.  During  each  pass of
the  compiler,  the source code  moves  closer  to  becoming
virtual machine  language (or whatever language the compiler
is designed to generate).


Lexical Analysis

In the first  pass  of  the  compiler,  the  source  code is
passed  through  a  lexical  analyzer,  which  converts  the
source code  to a set  of  tokens.  A  token  is generally a
number  representing   some  keyword   in  the  language.  A
compiler  has  a  unique number for each keyword  (i.e.  IF,
WHILE, END), and  each  arithmetic or logical operator (i.e.
+,  -, *, AND, OR,  etc.).  Numbers  are  represented  by  a
token  which  indicates  that  what  follows  it  should  be
interpreted  as a number.  The  tokens  put  the programming
language  into  a  form  that  can  be  checked  for  proper
structure and order.

The  other  important  task of the  lexical  analyzer  is to
build  a  symbol  table.  This  is   a   table  of  all  the
identifiers  (variable  names,  procedures,  and  constants)
used   in   the   program.  When   an  identifier  is  first
recognized by the analyzer,  it is inserted into  the symbol
table, along  with information about  its type, where  it is
to be stored, and so  forth.  This information  is  used  in
subsequent passes of the compiler.

Syntax Analysis

After the lexical analyzer translates  a program into tokens
of  keywords,  variables,  constants,  symbols  and  logical
operators, the compiler  makes  its next  pass.  To describe
what happens during  this  function, I will briefly  explain
grammars.

Grammars.  Like  any  language, programming languages have a
set of  rules governing  the structure of the program.  Each
different computer language  has its own grammar which makes
it unique.  Some grammars are complex  (PL/I) and others are
relatively  easy (Pascal).  The  programmer must observe all
the structural rules of  a language to make logical sense to
the computer.  The  next  step  of  the  compiling  process,
parsing, checks to be sure all the rules were followed.

Parsing.  The  parsing routines of a compiler check  to  see
that  the  program  is  written  correctly (according to the
language  rules).  The  parser reads in the tokens generated
by  the  lexical  analyzer and  compares  them  to  the  set
grammar  of  the  programming   language.  If   the  program
follows the rules of  the language, then it is syntactically
correct.  When the parser encounters  a mistake, it issues a
warning  or  error  message  and  tries  to  continue.  Some

parsers  try  to correct a faulty program,  others  do  not.
When the parser  reaches  the  end of the  token  stream, it
will  tell   the   compiler  that  either   the  program  is
grammatically  correct  and  compiling can continue  or  the
program  contains too  many errors  and  compiling  must  be
aborted.  If  the  program  is  grammatically  correct,  the
parser will call for semantic routines.

Semantic Routines.  The  semantic  routines  of  a  compiler
perform two tasks:  checking to make  sure that  each series
of tokens  will  be understood  by the  computer when it  is
fully translated to machine code, and converting the  series
of tokens  one step closer to machine code.  The first  task
takes a  series of tokens,  called  a production, and checks
it to see if  it makes sense.  For example, a production may
be  correct  as  far  as the  parser  is  concerned, but the
semantic routines  check  whether  the variables  have  been
declared,   and   are  of  the   right  type,  etc.  If  the
production  makes sense, the semantic  routine  reduces  the
production  for  the  next   phase   of  compilation,   code
generation.  Most of the  code for the compiler lies here in
the semantic  routines and thus takes  up  a majority of the
compilation time.

Summary.  Two major routines comprise  syntax analysis:  the
parsing  routine  and  the  semantic   routine.  The  parser
checks for the correct  order  of the tokens and then  calls
the  semantic routines to check whether the series of tokens
(a  production)  will   make  sense  to  the  computer.  The
semantic routine  then reduces the production  another  step
toward complete translation to machine code.

Code Generation

The  code generation process determines  how fast  the  code
will run and  how large it will be.  The  first part of code
generation involves  optimization, and  the  second involves
actual machine code generation.

Optimization.  In this step, the compiler  tries to make the
intermediate code  generated by  the  semantic routines more
efficient.  This  process  can be  very  slow and may not be
able  to  improve  the code  much.  Because  of  this,  many
compilers don't include optimizers, and, if  they  do,  they
look only for areas that are easy to optimize.

Code Generation.  This process takes  the intermediate  code
produced  by  the  optimizer  (or semantic routines  if  the
compiler  has no optimizer) and  generates  virtual  machine
code, which in  our  case  is 8088 machine code.  It is this
part  of  the compilation phase  that is machine  dependent.
Each  type  of  computer  has  an   operating   system  that
processes virtual machine code  differently; therefore,  the
code  generator   must  be   different   for  each  type  of
computer.  The  choice  of  instructions  for   the  fastest
execution  and smallest  code  size  are made at this point,
according  to  the  machine's operating  system.  Each  code
generator  is  designed  specifically for  the  machine  and
operating system the final code will run on.

If  the  program  is  free  from  syntactical  errors,  code
generation should take place without  any problem.  When the
code generator  is finished, the  code produced will  be  in
8088 machine code, but the format of  the  code  is  not yet
executable.  It is in  a  format (an .OBJ file in our  case)
that is  ready  to go  to  a  linker,  which  will create an
executable  *.EXE or *.COM file  from the  machine  code the
compiler has generated.
