Compilers and How They Work: An Overview Lou Morgan Madison IBM-PC User's Group What are compilers and how do they work? Many computer users ask themselves this question after the programming bug has bitten them. To most people, a compiler is a "black box program" which takes source code written in some high-level language, such as FORTRAN, BASIC, Pascal, or C, and translates (compiles) it into a language the computer can understand and execute. Compilers take source code and transform it into virtual machine language. With the IBM PC, this virtual language is 8088 machine code. Compilers vs. Interpreters Computers cannot understand English words and grammar. Even the highly structured words and sentences of programming languages must be translated before a computer can understand them. The compiler or interpreter must look up each "word" of your programming language in a kind of dictionary (or lexicon) and, in a series of steps, translate it into machine code. Each word initiates a separate logical task. An interpreter translates one line of source code at a time into machine code, and then executes it. Debugging and testing is relatively fast and easy in interpreted languages, since the entire program doesn't have to be reprocessed each time a change is made. The BASIC(A).COM program is an interpreter. Interpreted programs run much slower than compiled programs, because they must be translated each time they are run. Programmers often test and debug their programs using an interpreter and then compile them for production use. How Compilers Work Most compilers convert programs in three steps. Each step is called a pass. A particular compiler may have one program per pass, or may combine two or three steps in a single program. For a very complex language, a step may be so difficult to perform that it is broken up into many smaller steps. Regardless of how many passes or programs are required, the compiler performs only three main functions: first, lexical analysis; second, syntax analysis; and third, code generation. During each pass of the compiler, the source code moves closer to becoming virtual machine language (or whatever language the compiler is designed to generate). Lexical Analysis In the first pass of the compiler, the source code is passed through a lexical analyzer, which converts the source code to a set of tokens. A token is generally a number representing some keyword in the language. A compiler has a unique number for each keyword (i.e. IF, WHILE, END), and each arithmetic or logical operator (i.e. +, -, *, AND, OR, etc.). Numbers are represented by a token which indicates that what follows it should be interpreted as a number. The tokens put the programming language into a form that can be checked for proper structure and order. The other important task of the lexical analyzer is to build a symbol table. This is a table of all the identifiers (variable names, procedures, and constants) used in the program. When an identifier is first recognized by the analyzer, it is inserted into the symbol table, along with information about its type, where it is to be stored, and so forth. This information is used in subsequent passes of the compiler. Syntax Analysis After the lexical analyzer translates a program into tokens of keywords, variables, constants, symbols and logical operators, the compiler makes its next pass. To describe what happens during this function, I will briefly explain grammars. Grammars. Like any language, programming languages have a set of rules governing the structure of the program. Each different computer language has its own grammar which makes it unique. Some grammars are complex (PL/I) and others are relatively easy (Pascal). The programmer must observe all the structural rules of a language to make logical sense to the computer. The next step of the compiling process, parsing, checks to be sure all the rules were followed. Parsing. The parsing routines of a compiler check to see that the program is written correctly (according to the language rules). The parser reads in the tokens generated by the lexical analyzer and compares them to the set grammar of the programming language. If the program follows the rules of the language, then it is syntactically correct. When the parser encounters a mistake, it issues a warning or error message and tries to continue. Some parsers try to correct a faulty program, others do not. When the parser reaches the end of the token stream, it will tell the compiler that either the program is grammatically correct and compiling can continue or the program contains too many errors and compiling must be aborted. If the program is grammatically correct, the parser will call for semantic routines. Semantic Routines. The semantic routines of a compiler perform two tasks: checking to make sure that each series of tokens will be understood by the computer when it is fully translated to machine code, and converting the series of tokens one step closer to machine code. The first task takes a series of tokens, called a production, and checks it to see if it makes sense. For example, a production may be correct as far as the parser is concerned, but the semantic routines check whether the variables have been declared, and are of the right type, etc. If the production makes sense, the semantic routine reduces the production for the next phase of compilation, code generation. Most of the code for the compiler lies here in the semantic routines and thus takes up a majority of the compilation time. Summary. Two major routines comprise syntax analysis: the parsing routine and the semantic routine. The parser checks for the correct order of the tokens and then calls the semantic routines to check whether the series of tokens (a production) will make sense to the computer. The semantic routine then reduces the production another step toward complete translation to machine code. Code Generation The code generation process determines how fast the code will run and how large it will be. The first part of code generation involves optimization, and the second involves actual machine code generation. Optimization. In this step, the compiler tries to make the intermediate code generated by the semantic routines more efficient. This process can be very slow and may not be able to improve the code much. Because of this, many compilers don't include optimizers, and, if they do, they look only for areas that are easy to optimize. Code Generation. This process takes the intermediate code produced by the optimizer (or semantic routines if the compiler has no optimizer) and generates virtual machine code, which in our case is 8088 machine code. It is this part of the compilation phase that is machine dependent. Each type of computer has an operating system that processes virtual machine code differently; therefore, the code generator must be different for each type of computer. The choice of instructions for the fastest execution and smallest code size are made at this point, according to the machine's operating system. Each code generator is designed specifically for the machine and operating system the final code will run on. If the program is free from syntactical errors, code generation should take place without any problem. When the code generator is finished, the code produced will be in 8088 machine code, but the format of the code is not yet executable. It is in a format (an .OBJ file in our case) that is ready to go to a linker, which will create an executable *.EXE or *.COM file from the machine code the compiler has generated.