CCompilerSuite Project Manual

Luca Veraldi
Dept. of Computer Science, University of Pisa
veraldi${@}$di.unipi.it



Abstract:

CCompilerSuite is a complete compiling suite for an extended version of the ANSI standard C language. The suite is composed by several tools: a command-line Compiler which generates intermediate code, named CIL, an Assembler which packs CIL code into its binary representation, a Virtual Machine interpreter for CIL and a rudimental Debugger with common stepping features inside executable code as well as memory/register interospection.

CCompilerSuite has been thought to be an experimental solution, providing skilled users with a basic as well as complete compiler suite, able to be easly extended with more more advanced features, starting from intermediate code optimizations.

The project has been developed in C++; data structures and visitors come directly from the C++ Standard Template Library.






1 General Overview

The CCompilerSuite project was born in 2005 as an experimental compiler for C language, recognized with minor changes and extensions to the original ANSI standard. The experiment principally aims at improving users' skills with assembly language comprehension and manipulation.

The project employs a pure RISC-like assembly, with a register-register architecture and few instructions holding the responsibility for memory accessing and modification. This chosen RISC language, we named CIL, does not have any known real implementation and is only intended to satisfy didactic purposes.

The project is clearly divised into the following tools:

They complete as a whole the compiler suite. The rest of this Manual is organized as follows: Section 2 deals with the specific characteristics and advanced features of the Compiler, describing command-line arguments, structure of intermediate files and analyzing source syntax recongnized by the parser. It also deals with the optmizations the Compiler takes on the Abstract Syntax Tree and with the partial evaluator for constant expressions. Section 3 deals with the interface of the CIL Assembler and describes its input parameters.

Section 4 introduces a simple implementation of the CIL language and describes the interpretation of assembly instructions; finally, Section 5 shows the Debugger interface and describes its commands. Notes about project compiling and licencing eventually end up the documentation.


2 An ANSI-C Compiler

The ANSI-C Compiler is a two-phase language compiler: the Front-end analyzes the source code and produces a minimized Abstract Syntax Tree (AST), with tree node pruning on-the-fly; the Back-end simply traverses the AST and dumps intermediate code instructions in an interpretable format to file.


2.1 Command-line arguments

As shown in Figure 1, the Compiler takes several command-line arguments, usually with the same meanings as in gcc. Users can execute the compiler simply passing the pathname of the C source file to analyze; all other options are not mandatory: they include:

Figure: The Compiler help preamble and command-line options.
\begin{figure}\footnotesize
\begin{verbatim}--== \vert The C CCompiler compi...
... version information
-h print this help\end{verbatim}
\normalsize\end{figure}


2.2 Recognized C dialect

As previously stated, the Compiler recognizes dialect of the C programming language based on the ANSI standard. Minor extension to the original language grammar regards: As it should be clear from these points, the recognized language is pretty similiar to the ANSI standard and changes do not introduce any new expressivity feature. Maybe, it would be worth citing that the Compiler does not recognize GNU C extensions, like label addresses, which always result in parsing errors.


2.3 Abstract Syntax Tree and Code Rewriting

The Compiler is a two-phase code analyzer. In the first phase, the Front-end reads the C source code, recognizes language grammar productions and creates an Abstract Syntax Tree. In this phase, several sub-tree transformation and pruning actions will take place, too. This allows the creation of a minimized tree in a single pass through the source file, gaining in efficiency and compiling completion time. Moreover, a minimized tree results in shorter traversal, while generating intermediate code.

Pruning actions are taken while parsing the source code and are always related to C declaration, expressions or statements which would have no effects for the semantics of the program. Empty declarations; forward type or variable declarations, that are required for correct source parsing but will not end up into the final AST; useless cast expressions; goto statements pointing to the immediately following statement; empty case or default statements in switch blocks and similiar things are typical pruning subjects.

Moreover, pruning is also related to source code rewriting: the Compiler supports several semantics-invariant reconstructions of the original source tree, in a more compact or efficient representation, that always result in better intermediate code performances. Such rewriting rules mostly regard the partial evaluator for algebraic expressions and the C selection statements, e.g. if and switch C constructs. The partial evaluator replaces algebric expressions involving constant operands directly with their constant result, on-the-fly.

If statement rewriting rules regard constructs with a constant predicate: If statements with false predicate are directly replaced with the else branch, if any; when the predicate appears to be always true, the if construct is replaced with the then branch, if any, instead.

Analogous considerations apply to switch blocks: if predicate appears to be constant expression, the switch statement is passed through a partial evaluator which eliminates typical sequence of jumps with the code corresponding to the correct case block, if one exists, of with the default label, preserving semantics of original code. Moreover, switch statements with only a single case or default label are rewritten as simple if statements, preserving original semantics.

Finally, rewriting takes place whenever postfix side effects are present in algebraic expressions; in these cases, the expression will be rewritten in a semantically equivalent way, with all side effects removed (really, postponed). The side effect-related statements will get inserted by the Compiler in an artificial way, at the end of the expression evaluation, in the form of simple increment/decrement statements, preserving standard C evaluation rules.


2.4 Register allocation policy

During the compilation of C sources, the Compiler chooses Virtual Machine registers to save temporary values for assembly instructions. All registers are visible at the assembly level, and are therefore likely to get used by the compiler.

Actually, the register memory is used by the Compiler as a pure associative cache, with a simple LRU-based eviction algorithm. During low level instruction generation, the Compiler will therefore keep track of which symbolic values are stored in which register, within the Virtual Machine.

Whenever a temporary value needs to get reused, the register cache is queried for the requested value: if a register with that value is still available and did not already get evicted, it is used for further expression evaluation. Only in the case the value is not already present in current register snapshot, the corresponding expression requires evaluation and low-level assemble code will be consequently generated. Clearly, this saves the need for additional recomputation of shared sub-expressions in the code.

To grant semantic correctness, the virtual snapshot of Virtual Machine register memory kept by the Compiler is refreshed after every function call and function body compilation.


3 A CIL Assembler

The CIL Assembler is a simple recognizer of CIL source code, that will translate human-readable instructions into binary representation. Particularly, it packs instructions into Virtual Machine words and performs jumping address backpatch. You can modify the output of the Compiler, if you specified the ${-k}$ option, and introduce by hands several optimizations of the CIL assembly code, or do your own experiments from scratch. Specifically, you can edit the .symoblic file and pass it through the Assembler.


3.1 Command-line arguments

Figure 2 shows the command line arguments accepted by the Assembler program. To invoke the tool, simple type in the following command at the command prompt: CilAsm ${source\_file\_pathname}$.

Figure: The Assembler help preamble and command-line options.
\begin{figure}\footnotesize
\begin{verbatim}---== \vert The CIL Assembler v. ...
... version information
-h print this help\end{verbatim}
\normalsize\end{figure}


3.2 CIL Assembly dialect

CIL assembly language is a register-register assembly, with 50 instructions, devided up into 7 categories, w.r.t. their semantics. The language provides for 4 instructions able to directly access/modify program memory. At the CIL level, 64 general-purpose registers are made available, all visible and usable within CIL instructions. Registers are numbered from RG[0] up to RG[63] and organized as follows:

The instruction format is composed by 32 bits, so as to fit a memory word, with no other payloads. The bits are organized as follows (Figure 3):

Instructions requiring less that three arguments obey to a sligthly different format. Test instructions (e.g. IFEQ, IFSLT) take only two arguments and a virtual address, or relative offeset, of the jump target. This constant can use all the remaining bits (therefore, up to 12 bits). The MOV instruction can move a constant to one of the Virtual Machine general-purpose registers. This constant can use all the remaining bits (therefore, up to 18 bits). Finally, the unconditional jump instruction (GOTO), apart from the Operative Code and control bits, uses all the remaining bits (therefore, 24 bits) to express the virtual address of jump target or relative offset. Figure 3 explains the different instruction layouts.

Figure: The CIL instruction format.
\begin{figure}\footnotesize
\begin{verbatim}General format:
[ CODE ] [ ARG MO...
...------------------------------->
6 2 24\end{verbatim}
\normalsize\end{figure}

We identify the following instruction categories:

For instructions with less than three arguments, the Operative Code bits assume a quite special meaning, as reported in Figure 3: bits number 0 and 1 are always set to 1; bits number 2 and 3 represent the number of arguments the instruction will take (00 for no arguments at all, 01 for only one argument and 1- for more arguments); remaining bits simply distinguish instructions inside their category.

Additional control bits specify the interpretation of arguments inside Virtual Machine: specifically, they identify arguments as being either constant values or indexes for general-purpose registers.

For instruction taking three arguments, second and third arguments can be either a signed integer constant or a register index. So, constants which can be expressed as payload arguments in such instructions range from ${-2^{5}}$ to ${2^{5}-1}$.

Test instructions can jump to either a relative, signed offset constant, therefore ranging from ${-2^{11}}$ to ${2^{11}-1}$, or to the relative signed offset stored in one of the Virtual Machine general-purpose registers.

Moreover, we can move integer constants into Virtual Machine registers with the MOV instruction, if constants range from ${-2^{17}}$ to ${2^{17}-1}$. Otherwise, register-to-regsiter copy is required.

Finally, the unconditional jump instruction can jump to a relative, signed offset constant, therefore ranging from ${-2^{23}}$ to ${2^{23}-1}$, or to the relative signed offset stored in one of the Virtual Machine general-purpose registers. Moreover, bare goto statements in source can be translated as CIL absolute jump instructions, i.e. relative to the 0 initial virtual address generated by the Compiler.



Category Code Mask Instructions
Short math 000 -- ADD SUB SHL SHR
Long math 001 -- MUL DIV MOD SQRT POW LOG
Comparison 010 --
IFSLT IFSGT IFSGE IFSLE (signed)
IFULT IFUGT IFUGE IFULE (unsined)
Bit-to-bit 011 -- IFEQ IFNE AND OR XOR
Memory 100 -- LOAD LOADS LOADB STORE STORES STOREB
Trigonometry 101 -- SIN COS TAN ASN ACS ATN
Others
11 00 -
11 01 -
11 10 -
11 11 -
NOP INT RET END (no arguments)
PUSH POP GOTO CALL (one argument)
MOV NOT BIT0 SIGN (two arguments)
HIGH LOW NEG (two arguments)


3.3 Backpatching algorithm

The CIL instruction format imposes several limitations upon constant jump values, both in conditional branches as well as unconditional jumps, as can be derived from previous discussion; specifically: These instruction classes will be referred to as a whole with the generic terms of jumps or jump instructions, in the rest.

These considerations raise problems, when backpatching jumps: whenever an unknown address need to be used in a earlier place in code, we remember the place the instruction occurred at and the target address it depends on. As soon as this address will be generated by the Compiler, e.g. because we eventually encounter the function implementation we previously called, all instructions depending on this address need to get backpatched with the actual constant. The constant has to be either the absolute virtual address of the required instruction or a relative offset from the referring one in code.

If such constant does not fit the instruction format, it needs to be obtained by some arithmetic manipulations upon registers.

Backpatch is not a novel concept, in compiler theory: here we introduce a backpatching algorithm for our pure RISC-like assembly language. The algorithm clearly shows two phases: in the first one, a least fixed-point is computed, upon constant jump values; in the second one, instructions requiring backpatching are actually fixed with the correct constant, if it fits the instruction format, or required code is properly added before it, to obtain the constant as well.

Figure 4 shows involed data structures. During intermediate code generation, in the Compiler back-end, we collect a list of instructions requiring backpatching, i.e referring to not-yet-generated virtual addresses, and also collect for each instruction the virtual address it appeared at (required to compute the relative offset in code, as required).

Figure: Backpatching algorithm: data structures involved. (a) is the main list of instructions that require being backpatched, with related addresses they appeared at, in the code; (b) is the code of instructions where jump constant does not fit instruction format and will be used in the second algorithm phase; (c) is the automatically rebalancing, binary serach tree, to cope with dynamic adding of instructions in code.
\includegraphics[scale=0.25]{ds.eps}

At the end of the intermediate code generation phase, the backpatched algorithm is run: for each collected instruction in the list, we compute source address (the address it appeared at) and destination one (i.e. the address of the required jump target). Then we compute the jump constant, depending on the instruction type and decide weather or not the constant fit the instruction format, w.r.t. the limits exposed previously. If constant does not, we have to obtain it in an alternative way. Usually, a MOV instruction will resolve the problem, allowing to cope with constants in a wider range, than the one allowed for conditional jumps. Otherwise, elementary arithmetic on constant and registers is used.

If the constant does not fit the instruction format, additional instructions need to be added before the one being backpatched: either a single MOV instruction, or several arithmetic instructions. Therefore, while executing the algorithm, we simulate the code completion and take into account additional instructions to be generated.

With special reference to loop constructs, the addition of one or more instructions to the generated code may invalidate previously computed jump constants. Moreover, the same activity will shift down virtual addresses for functions and labeled statements, influencing call instructions and forced GOTO, as well. For this reason, a least fixed-point computation is involved.

To take into account the addition of new instructions to the one already prepared at the end of Compiler back-end phase, we use an AVL binary search tree, i.e. an automatically rebalancing tree, where each node contains the accumulated number of instructions added up to that point in code. This values are used when computing relative offsets and absolute virtual addresses for jumps, being the original ones plus the accumulated displacement found in the correct position of the AVL tree.

At the end of the least fixed-point computation, the second phase of the algorithm starts, where binary instruction formats actually get backpatched and required additional instructions added to the code, in the correct place.

Figure: Backpatching algorithm: pseudocode.
\begin{figure}\footnotesize
\begin{verbatim}/* Backpatching algorithm for the ...
...ions( requiredInstructions(jump) );
}
}\end{verbatim}
\normalsize\end{figure}


3.4 Binary executable format

Users can generate binary executables in two different ways: either writing down ANSI-C sources and then compiling them with the CCompiler; or editing a text file with CIL instructions and then passing it through the CIL Assembler.

Either way, if successful, the binary file produced can be directly executed by the CIL Virtual Machine, which represents an implementation of the assembly dialect and can, therefore, interpret the binary instruction formats.

Binary files have the following layout:


4 A RISC Virtual Machine

The Virtual Machine tool is an interpreter for the chosen RISC-like assembly language. It can be used to execute the binary code generated by the Compiler. The Virtual Machine also exports operations over the binary code which will be exploited by the assembly-level Debugger, as we will see later on in Section 5.


4.1 Command-line arguments

Figure 5 shows the command line arguments accepted by the Virtual Machine program. To execute a binary file produced by the compiler, simple type in the following command at the command prompt: CilVM ${binary\_file\_pathname}$.

Figure: The Virtual Machine help preamble and command-line options.
\begin{figure}\footnotesize
\begin{verbatim}---== \vert The CIL Virtual Machi...
... version information
-h print this help\end{verbatim}
\normalsize\end{figure}


4.2 CIL Stack

The Cil Virtual Machine maintains an internal program stack, which layout is represented in Figure 6. Upon each function call, caller will push actual parameters onto the stack, in reverse order, as from C-calling conventions. On top of the stack will stay the return value of the function: local variable allocation can make the stack growth further.

On function return, it is at the caller's reponsibility to restore the stack pointer to its correct position in virtual memory, eventually popping function result, as needed, and assigning the stack frame pointer the original value it had before the call took place.

Figure: The CIL stack layout
\begin{figure}\centering
\footnotesize
\begin{verbatim}----------------
\vert...
... ... \vert
\vert \vert
----------------\end{verbatim}
\normalsize\end{figure}

The intermediate assembly language provide for two instructions, directly managing the stack: PUSH and POP. An reserved integer Virtual Machine register, RG[stack], is always allocated to contain the current pointer to the first free byte in the stack. These instructions are actually only mnemonic shortcuts for STORE and LOAD, respectively, being the only memory-register class instructions. The different operative codes for stack-related instructions allow for fewer parameters, in instruction format (address of the stack is fixed and implicitly obtained by the value of the RG[stack] register).

Moreover, the interpreted program can directly modify the stack pointer, with usual register manipulation instructions, therefore reserving or releasing byte blocks of stack memory with simple ADD or SUB instructions. The Virtual Machine will treat the interpreted program's stack a a vector of byte, being the value of RG[stack] simply the pointer of &stack[0]. An initial default stack size is reserved, upon program interpretation startup.

On the top of the stack, it is logically positioned the heap allocation area, managed through conventional call to malloc and free runtime support functions. The interpreted program and its Virtual Machine actually share their virtual address space: therefore, the heap of the interpreted program is shared with the one of the Virtual Machine itself.


5 A RISC Debugger

Coming soon...


6 Compiling and Distributing the Project

The project does not come with a standard configure utility. Instead, a Makefile is provided, in the project root directory. It contains all the information pieces required to correctly build and install the solution.

The provided Makefile allows to build two different configurations of the project: release and debug. Debugging configuration has been useful, during feature implementation phase, and maybe it is of any interest for final users, any more.

The release option, instead, is the configuration which enables all solution code optimizations and produces highly lightweight and efficient executables for the suite tools. Compiling in release configuration may take a while, due to the heavy employment of inlining features in solution codes, and to the compiling options, which include several expensive assembly-level optimizations.

To compile the suite, just type the following command at the command prompt: make release install. The generation of this project documentation is achieved through make doc, instead.

After the compiling phase successful termination, in the bin/ directory under the project root will appear three links to the suite tool binaries: CCompiler for the Compiler, CilVM for the CIL Virtual Machine and CilDbg for the CIL assembly-level Debugger.


6.1 Project dependencies

Project build requires several libraries and tools to be already installed on the target machine, in order to correctly link the final executables. These software pieces are:

Make sure to have all the tools and libraries required correctly installed, before compiling the suite.


6.2 Using the Code

CCompilerSuite is distributed under the GNU General Public Licence (GPL). Modifying, distributing and copying of this software or part of it are subject to the GPL terms and conditions of use.

Copyright (C) 2006, Luca Veraldi

This program is free software; you can redistribute it and/or modify it
under the terms of the GNU General Public License as published
by the Free Software Foundation; either version 2 of the License,
or (at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty
of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software Foundation, Inc.,
59 Temple Place, Suite 330, Boston, MA 02111-1307 USA

About this document ...

CCompilerSuite Project Manual

This document was generated using the LaTeX2HTML translator Version 2002-2-1 (1.71)

Copyright © 1993, 1994, 1995, 1996, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999, Ross Moore, Mathematics Department, Macquarie University, Sydney.

The command line arguments were:
latex2html CCompilerSuiteDoc.tex -split 2 -nonavigation -notop_navigation -nobottom_navigation -noauto_navigation -noindex_in_navigation -antialias_text -transparent -white -show_section_numbers -show_init

The translation was initiated by Luca Veraldi on 2006-04-16


Luca Veraldi 2006-04-16