CCompilerSuite Project Manual

Luca Veraldi

Dept. of Computer Science, University of Pisa

veraldidi.unipi.it

Abstract:

CCompilerSuite is a complete compiling suite for an extended version of the ANSI standard C language. The suite is composed by several tools: a command-line Compiler which generates intermediate code, named CIL, an Assembler which packs CIL code into its binary representation, a Virtual Machine interpreter for CIL and a rudimental Debugger with common stepping features inside executable code as well as memory/register interospection.

CCompilerSuite has been thought to be an experimental solution, providing skilled users with a basic as well as complete compiler suite, able to be easly extended with more more advanced features, starting from intermediate code optimizations.

The project has been developed in C++; data structures and visitors come directly from the C++ Standard Template Library.

1 General Overview

The CCompilerSuite project was born in 2005 as an experimental compiler for C language, recognized with minor changes and extensions to the original ANSI standard. The experiment principally aims at improving users' skills with assembly language comprehension and manipulation.

The project employs a pure RISC-like assembly, with a register-register architecture and few instructions holding the responsibility for memory accessing and modification. This chosen RISC language, we named CIL, does not have any known real implementation and is only intended to satisfy didactic purposes.

The project is clearly divised into the following tools:

an ANSI-C Compiler, which is a simple and efficient command-line compiler for standard C language;
a CIL Assembler, which takes a human-readable CIL source code and packs it into its binary representation, executable by the Virtual Machine;
a Virtual Machine for the CIL intermediate language, which can interpret and execute the generated binary code
a rudimental assembly-level Debugger, which is command-line debugging tool, for unexperienced dummies.

They complete as a whole the compiler suite. The rest of this Manual is organized as follows: Section 2 deals with the specific characteristics and advanced features of the Compiler, describing command-line arguments, structure of intermediate files and analyzing source syntax recongnized by the parser. It also deals with the optmizations the Compiler takes on the Abstract Syntax Tree and with the partial evaluator for constant expressions. Section 3 deals with the interface of the CIL Assembler and describes its input parameters.

Section 4 introduces a simple implementation of the CIL language and describes the interpretation of assembly instructions; finally, Section 5 shows the Debugger interface and describes its commands. Notes about project compiling and licencing eventually end up the documentation.

2 An ANSI-C Compiler

The ANSI-C Compiler is a two-phase language compiler: the Front-end analyzes the source code and produces a minimized Abstract Syntax Tree (AST), with tree node pruning on-the-fly; the Back-end simply traverses the AST and dumps intermediate code instructions in an interpretable format to file.

2.1 Command-line arguments

As shown in Figure 1, the Compiler takes several command-line arguments, usually with the same meanings as in gcc. Users can execute the compiler simply passing the pathname of the C source file to analyze; all other options are not mandatory: they include:

**Figure:** The Compiler help preamble and command-line options.
$\begin{figure}\footnotesize \begin{verbatim}--== \vert The C CCompiler compi... ... version information -h print this help\end{verbatim} \normalsize\end{figure}$

${-o <file\_out>}$ , to specify the pathname of the binary executable to create, in case of successfull compilation.
Equavalent to gcc option;
, to enable compiler optimizations on intermediate code. Several, more trivial optimization techniques are actually always enabled by default, when compiling the code. This option will be related to more advanced and, maybe, expensive ones, with the possibility to also specify the amount of time to spent during optimization phase.
This is an experimental feature not yet available, at the current developing status of the project;
is used to enable several warnings about potentially wrong constructs in source code or programming techniques which will result in dangerous code, like returning addess of locally alocated variables or missing a return statement in some of the code branches for a function returning non-void.
This option will also enable lots of warning regarding extensions to usual ANSI-C syntax, like return type defaulting to int for functions declarations.
Several warnings, considered particularly dangerous, are always enabled, no matter the option is given or not: these warnings are most related to the partial evaluator engine, to mismatching parameter number for functions or const qualifier discarding in assignments.
Equavalent to gcc options ;
, to invoke the Compiler Front-end, alone, without generating executables. This only checks for syntax to be correct.
Equivalent to gcc option;
is used to save temporary compiling files to disk. The Compiler supports dumping of three kinds of temporary files: an high-level Abstract Syntax Tree representation, in XML format, the rawly generated intermediate code, if no option has been given, that can be modified by skilled users and passed through the Assembler to produce a new binary executable, and the final backpatched code, that can be used within the assembly-level Debugger. Files are all in human-readable format and have extensions .xml, .symbolic and .cil, respectively.
${-d <expr\vert decl\vert stmnt\vert all>}$ prints to screen the grammar used by the CCompiler parser during source code analysis. Users can specify which branch of the grammar to be shown, choosing among expression productions (), declarations (), statements and code blocks (), or the entire syntax grammar ().
prints the Compiler version, currently 1.0.0.
Equivalent to gcc option;
prints the help reported in Figure.

2.2 Recognized C dialect

As previously stated, the Compiler recognizes dialect of the C programming language based on the ANSI standard. Minor extension to the original language grammar regards:

old-style function parameter declaration: the Compiler allows syntactic sugar like the one presented here:
int function(a, b) int a; char* b; { <body> }
instead of
int function(int a, char* b) { <body> }
default type for declarations: the Compiler allows declarations not to have a type, which will default to signed int;
empty aggregated types: the Compiler does not allow for structures as well as unions to have no fields;
arrow operator: the Compiler allows syntactic sugar inherited from C++, like the one presented here:
... ptr->field ...
instead of
... (*ptr).field ...
enumeration: the Compiler allows identifier lists for enumerations to end with a comma, before the closing brace.

As it should be clear from these points, the recognized language is pretty similiar to the ANSI standard and changes do not introduce any new expressivity feature. Maybe, it would be worth citing that the Compiler does not recognize GNU C extensions, like label addresses, which always result in parsing errors.

2.3 Abstract Syntax Tree and Code Rewriting

The Compiler is a two-phase code analyzer. In the first phase, the Front-end reads the C source code, recognizes language grammar productions and creates an Abstract Syntax Tree. In this phase, several sub-tree transformation and pruning actions will take place, too. This allows the creation of a minimized tree in a single pass through the source file, gaining in efficiency and compiling completion time. Moreover, a minimized tree results in shorter traversal, while generating intermediate code.

Pruning actions are taken while parsing the source code and are always related to C declaration, expressions or statements which would have no effects for the semantics of the program. Empty declarations; forward type or variable declarations, that are required for correct source parsing but will not end up into the final AST; useless cast expressions; goto statements pointing to the immediately following statement; empty case or default statements in switch blocks and similiar things are typical pruning subjects.

Moreover, pruning is also related to source code rewriting: the Compiler supports several semantics-invariant reconstructions of the original source tree, in a more compact or efficient representation, that always result in better intermediate code performances. Such rewriting rules mostly regard the partial evaluator for algebraic expressions and the C selection statements, e.g. if and switch C constructs. The partial evaluator replaces algebric expressions involving constant operands directly with their constant result, on-the-fly.

If statement rewriting rules regard constructs with a constant predicate: If statements with false predicate are directly replaced with the else branch, if any; when the predicate appears to be always true, the if construct is replaced with the then branch, if any, instead.

Analogous considerations apply to switch blocks: if predicate appears to be constant expression, the switch statement is passed through a partial evaluator which eliminates typical sequence of jumps with the code corresponding to the correct case block, if one exists, of with the default label, preserving semantics of original code. Moreover, switch statements with only a single case or default label are rewritten as simple if statements, preserving original semantics.

Finally, rewriting takes place whenever postfix side effects are present in algebraic expressions; in these cases, the expression will be rewritten in a semantically equivalent way, with all side effects removed (really, postponed). The side effect-related statements will get inserted by the Compiler in an artificial way, at the end of the expression evaluation, in the form of simple increment/decrement statements, preserving standard C evaluation rules.

2.4 Register allocation policy

During the compilation of C sources, the Compiler chooses Virtual Machine registers to save temporary values for assembly instructions. All registers are visible at the assembly level, and are therefore likely to get used by the compiler.

Actually, the register memory is used by the Compiler as a pure associative cache, with a simple LRU-based eviction algorithm. During low level instruction generation, the Compiler will therefore keep track of which symbolic values are stored in which register, within the Virtual Machine.

Whenever a temporary value needs to get reused, the register cache is queried for the requested value: if a register with that value is still available and did not already get evicted, it is used for further expression evaluation. Only in the case the value is not already present in current register snapshot, the corresponding expression requires evaluation and low-level assemble code will be consequently generated. Clearly, this saves the need for additional recomputation of shared sub-expressions in the code.

To grant semantic correctness, the virtual snapshot of Virtual Machine register memory kept by the Compiler is refreshed after every function call and function body compilation.

3 A CIL Assembler

The CIL Assembler is a simple recognizer of CIL source code, that will translate human-readable instructions into binary representation. Particularly, it packs instructions into Virtual Machine words and performs jumping address backpatch. You can modify the output of the Compiler, if you specified the

option, and introduce by hands several optimizations of the CIL assembly code, or do your own experiments from scratch. Specifically, you can edit the .symoblic file and pass it through the Assembler.

3.1 Command-line arguments

Figure 2 shows the command line arguments accepted by the Assembler program. To invoke the tool, simple type in the following command at the command prompt: CilAsm.

**Figure:** The Assembler help preamble and command-line options.
$\begin{figure}\footnotesize \begin{verbatim}---== \vert The CIL Assembler v. ... ... version information -h print this help\end{verbatim} \normalsize\end{figure}$

${-o <file\_out>}$ , to specify the pathname of the binary executable to create, in case of successfull assembling.
, to invoke the Assembler Front-end, alone, without generating executables. This only checks for syntax to be correct.
is used to save the temporary file to disk. The Assembler supports dumping of the final backpatched file, if no option has been given, that can be used within the assembly-level Debugger. The file is in human-readable format and has .cil extension.
prints the Assembler version, currently 1.0.0.
prints the help reported in Figure.

3.2 CIL Assembly dialect

CIL assembly language is a register-register assembly, with 50 instructions, devided up into 7 categories, w.r.t. their semantics. The language provides for 4 instructions able to directly access/modify program memory. At the CIL level, 64 general-purpose registers are made available, all visible and usable within CIL instructions. Registers are numbered from RG[0] up to RG[63] and organized as follows:

32 integer registers, ranging from RG[0] up to RG[31]. Registers in range 21-23 inclusive are reserved by Virtual Machine; specifically:
- RG[21] also named RG[zero] contains the constant 0;
- RG[22] also named RG[stack] contains the stack pointer;
- RG[23] also named RG[PCB] constaint the PCB pointer;
32 floating-point registers, ranging from RG[32] up to RG[63].

The instruction format is composed by 32 bits, so as to fit a memory word, with no other payloads. The bits are organized as follows (Figure 3):

6 bits for the Operative Code of the instruction. These bits clearly identify the instruction category (most significant 3 bits) and the type of the instruction inside its category (remaining 3 bits);
2 control bits, to specify operand interpretation inside Virtual Machine;
6 bits for each of the three operands of the instruction;
6 bits reserved for future use and Virtual Machine extensions.

Instructions requiring less that three arguments obey to a sligthly different format. Test instructions (e.g. IFEQ, IFSLT) take only two arguments and a virtual address, or relative offeset, of the jump target. This constant can use all the remaining bits (therefore, up to 12 bits). The MOV instruction can move a constant to one of the Virtual Machine general-purpose registers. This constant can use all the remaining bits (therefore, up to 18 bits). Finally, the unconditional jump instruction (GOTO), apart from the Operative Code and control bits, uses all the remaining bits (therefore, 24 bits) to express the virtual address of jump target or relative offset. Figure 3 explains the different instruction layouts.

**Figure:** The CIL instruction format.
$\begin{figure}\footnotesize \begin{verbatim}General format: [ CODE ] [ ARG MO... ...-------------------------------> 6 2 24\end{verbatim} \normalsize\end{figure}$

We identify the following instruction categories:

short math, for simple and most common algebra;
long math, including multiplication and logarithm;
comparisons, divided into comparisons for signed and unsigned;
bit operations, using arguments as bit masks;
basic trigonometric functions and their inverses;
memory access/modify, the only instructions accessing memory;
instructions with less than three arguments.

For instructions with less than three arguments, the Operative Code bits assume a quite special meaning, as reported in Figure 3: bits number 0 and 1 are always set to 1; bits number 2 and 3 represent the number of arguments the instruction will take (00 for no arguments at all, 01 for only one argument and 1- for more arguments); remaining bits simply distinguish instructions inside their category.

Additional control bits specify the interpretation of arguments inside Virtual Machine: specifically, they identify arguments as being either constant values or indexes for general-purpose registers.

For instruction taking three arguments, second and third arguments can be either a signed integer constant or a register index. So, constants which can be expressed as payload arguments in such instructions range from ${-2^{5}}$ to ${2^{5}-1}$ .

Test instructions can jump to either a relative, signed offset constant, therefore ranging from ${-2^{11}}$ to ${2^{11}-1}$ , or to the relative signed offset stored in one of the Virtual Machine general-purpose registers.

Moreover, we can move integer constants into Virtual Machine registers with the MOV instruction, if constants range from ${-2^{17}}$ to ${2^{17}-1}$ . Otherwise, register-to-regsiter copy is required.

Finally, the unconditional jump instruction can jump to a relative, signed offset constant, therefore ranging from ${-2^{23}}$ to ${2^{23}-1}$ , or to the relative signed offset stored in one of the Virtual Machine general-purpose registers. Moreover, bare goto statements in source can be translated as CIL absolute jump instructions, i.e. relative to the 0 initial virtual address generated by the Compiler.

Category Code Mask Instructions

Short math 000 -- ADD SUB SHL SHR

Long math 001 -- MUL DIV MOD SQRT POW LOG

Comparison

010 --

`IFSLT IFSGT IFSGE IFSLE`	(`signed`)
`IFULT IFUGT IFUGE IFULE`	(`unsined`)

Bit-to-bit 011 -- IFEQ IFNE AND OR XOR

Memory 100 -- LOAD LOADS LOADB STORE STORES STOREB

Trigonometry 101 -- SIN COS TAN ASN ACS ATN

Others

11 00 -

11 01 -

11 10 -

11 11 -

`NOP INT RET END`	(no arguments)
`PUSH POP GOTO CALL`	(one argument)
`MOV NOT BIT0 SIGN`	(two arguments)
`HIGH LOW NEG`	(two arguments)

3.3 Backpatching algorithm

The CIL instruction format imposes several limitations upon constant jump values, both in conditional branches as well as unconditional jumps, as can be derived from previous discussion; specifically:

in conditional branches, jump constant must be limited in value from ${-2^{11}}$ to ${2^{11}-1}$ , and is always generated as a relative offset from current instruction address;
when initializing pointer-to-function variables with the name of a declared function, a MOV instruction is generated, where the constant moved is the absolute virtual address of the function first instruction, and constant must be limited in value from to ${2^{17}-1}$ ;
in unconditional jumps, the constant must be limited in value from ${-2^{23}}$ to ${2^{23}-1}$ , and is always generated as a relative offset from current instruction address;
in function calls, the constant value is the absolute virtual address of the function first instruction, and obeys to the same limitations as in a bare GOTO jump constant, so ranging from up to ${2^{23}-1}$ ;

These instruction classes will be referred to as a whole with the generic terms of jumps or jump instructions, in the rest.

These considerations raise problems, when backpatching jumps: whenever an unknown address need to be used in a earlier place in code, we remember the place the instruction occurred at and the target address it depends on. As soon as this address will be generated by the Compiler, e.g. because we eventually encounter the function implementation we previously called, all instructions depending on this address need to get backpatched with the actual constant. The constant has to be either the absolute virtual address of the required instruction or a relative offset from the referring one in code.

If such constant does not fit the instruction format, it needs to be obtained by some arithmetic manipulations upon registers.

Backpatch is not a novel concept, in compiler theory: here we introduce a backpatching algorithm for our pure RISC-like assembly language. The algorithm clearly shows two phases: in the first one, a least fixed-point is computed, upon constant jump values; in the second one, instructions requiring backpatching are actually fixed with the correct constant, if it fits the instruction format, or required code is properly added before it, to obtain the constant as well.

Figure 4 shows involed data structures. During intermediate code generation, in the Compiler back-end, we collect a list of instructions requiring backpatching, i.e referring to not-yet-generated virtual addresses, and also collect for each instruction the virtual address it appeared at (required to compute the relative offset in code, as required).

**Figure:** Backpatching algorithm: data structures involved. (a) is the main list of instructions that require being backpatched, with related addresses they appeared at, in the code; (b) is the code of instructions where jump constant does not fit instruction format and will be used in the second algorithm phase; (c) is the automatically rebalancing, binary serach tree, to cope with dynamic adding of instructions in code.
$\includegraphics[scale=0.25]{ds.eps}$

At the end of the intermediate code generation phase, the backpatched algorithm is run: for each collected instruction in the list, we compute source address (the address it appeared at) and destination one (i.e. the address of the required jump target). Then we compute the jump constant, depending on the instruction type and decide weather or not the constant fit the instruction format, w.r.t. the limits exposed previously. If constant does not, we have to obtain it in an alternative way. Usually, a MOV instruction will resolve the problem, allowing to cope with constants in a wider range, than the one allowed for conditional jumps. Otherwise, elementary arithmetic on constant and registers is used.

If the constant does not fit the instruction format, additional instructions need to be added before the one being backpatched: either a single MOV instruction, or several arithmetic instructions. Therefore, while executing the algorithm, we simulate the code completion and take into account additional instructions to be generated.

With special reference to loop constructs, the addition of one or more instructions to the generated code may invalidate previously computed jump constants. Moreover, the same activity will shift down virtual addresses for functions and labeled statements, influencing call instructions and forced GOTO, as well. For this reason, a least fixed-point computation is involved.

To take into account the addition of new instructions to the one already prepared at the end of Compiler back-end phase, we use an AVL binary search tree, i.e. an automatically rebalancing tree, where each node contains the accumulated number of instructions added up to that point in code. This values are used when computing relative offsets and absolute virtual addresses for jumps, being the original ones plus the accumulated displacement found in the correct position of the AVL tree.

At the end of the least fixed-point computation, the second phase of the algorithm starts, where binary instruction formats actually get backpatched and required additional instructions added to the code, in the correct place.

**Figure:** Backpatching algorithm: pseudocode.
$\begin{figure}\footnotesize \begin{verbatim}/* Backpatching algorithm for the ... ...ions( requiredInstructions(jump) ); } }\end{verbatim} \normalsize\end{figure}$

3.4 Binary executable format

Users can generate binary executables in two different ways: either writing down ANSI-C sources and then compiling them with the CCompiler; or editing a text file with CIL instructions and then passing it through the CIL Assembler.

Either way, if successful, the binary file produced can be directly executed by the CIL Virtual Machine, which represents an implementation of the assembly dialect and can, therefore, interpret the binary instruction formats.

Binary files have the following layout:

an 8-byte-long preamble, containing two integer values with the byte size of the bss segment and of the text segment;
the bss segment, with three distinct portions: a fixed-size PCB, with process priority, initial register values, program counter with virtual address of main entry point and the virtual address of process relocation table; the process relocation table; the constant table;
the text segment, with the program CIL instructions.

4 A RISC Virtual Machine

The Virtual Machine tool is an interpreter for the chosen RISC-like assembly language. It can be used to execute the binary code generated by the Compiler. The Virtual Machine also exports operations over the binary code which will be exploited by the assembly-level Debugger, as we will see later on in Section 5.

4.1 Command-line arguments

Figure 5 shows the command line arguments accepted by the Virtual Machine program. To execute a binary file produced by the compiler, simple type in the following command at the command prompt: CilVM.

**Figure:** The Virtual Machine help preamble and command-line options.
$\begin{figure}\footnotesize \begin{verbatim}---== \vert The CIL Virtual Machi... ... version information -h print this help\end{verbatim} \normalsize\end{figure}$

${-s <stack_size}$ , to specify the size of the internal Virtual Machine stack, in bytes.
prints the Virtual Machine version, currently 1.0.0.
prints the help reported in Figure.

4.2 CIL Stack

The Cil Virtual Machine maintains an internal program stack, which layout is represented in Figure 6. Upon each function call, caller will push actual parameters onto the stack, in reverse order, as from C-calling conventions. On top of the stack will stay the return value of the function: local variable allocation can make the stack growth further.

On function return, it is at the caller's reponsibility to restore the stack pointer to its correct position in virtual memory, eventually popping function result, as needed, and assigning the stack frame pointer the original value it had before the call took place.

**Figure:** The CIL stack layout
$\begin{figure}\centering \footnotesize \begin{verbatim}---------------- \vert... ... ... \vert \vert \vert ----------------\end{verbatim} \normalsize\end{figure}$

The intermediate assembly language provide for two instructions, directly managing the stack: PUSH and POP. An reserved integer Virtual Machine register, RG[stack], is always allocated to contain the current pointer to the first free byte in the stack. These instructions are actually only mnemonic shortcuts for STORE and LOAD, respectively, being the only memory-register class instructions. The different operative codes for stack-related instructions allow for fewer parameters, in instruction format (address of the stack is fixed and implicitly obtained by the value of the RG[stack] register).

Moreover, the interpreted program can directly modify the stack pointer, with usual register manipulation instructions, therefore reserving or releasing byte blocks of stack memory with simple ADD or SUB instructions. The Virtual Machine will treat the interpreted program's stack a a vector of byte, being the value of RG[stack] simply the pointer of &stack[0]. An initial default stack size is reserved, upon program interpretation startup.

On the top of the stack, it is logically positioned the heap allocation area, managed through conventional call to malloc and free runtime support functions. The interpreted program and its Virtual Machine actually share their virtual address space: therefore, the heap of the interpreted program is shared with the one of the Virtual Machine itself.

5 A RISC Debugger

Coming soon...

6 Compiling and Distributing the Project

The project does not come with a standard configure utility. Instead, a Makefile is provided, in the project root directory. It contains all the information pieces required to correctly build and install the solution.

The provided Makefile allows to build two different configurations of the project: release and debug. Debugging configuration has been useful, during feature implementation phase, and maybe it is of any interest for final users, any more.

The release option, instead, is the configuration which enables all solution code optimizations and produces highly lightweight and efficient executables for the suite tools. Compiling in release configuration may take a while, due to the heavy employment of inlining features in solution codes, and to the compiling options, which include several expensive assembly-level optimizations.

To compile the suite, just type the following command at the command prompt: make release install. The generation of this project documentation is achieved through make doc, instead.

After the compiling phase successful termination, in the bin/ directory under the project root will appear three links to the suite tool binaries: CCompiler for the Compiler, CilVM for the CIL Virtual Machine and CilDbg for the CIL assembly-level Debugger.

6.1 Project dependencies

Project build requires several libraries and tools to be already installed on the target machine, in order to correctly link the final executables. These software pieces are:

flex and bison, common GNU tools for lexical and syntactic analysis, respectively. They are extremely wide-used, and maybe already present in most Linux OS distributions;
libfl, for command-line argument parsing. For this, same considerations hold.
L^ATEXand dvipdf, the powerful text processing suite, required to compile this project manual and documentation.

Make sure to have all the tools and libraries required correctly installed, before compiling the suite.

6.2 Using the Code

CCompilerSuite is distributed under the GNU General Public Licence (GPL). Modifying, distributing and copying of this software or part of it are subject to the GPL terms and conditions of use.

This program is free software; you can redistribute it and/or modify it
under the terms of the GNU General Public License as published
by the Free Software Foundation; either version 2 of the License,
or (at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty
of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software Foundation, Inc.,
59 Temple Place, Suite 330, Boston, MA 02111-1307 USA

About this document ...

CCompilerSuite Project Manual

This document was generated using the LaTeX2HTML translator Version 2002-2-1 (1.71)

The command line arguments were:
latex2html CCompilerSuiteDoc.tex -split 2 -nonavigation -notop_navigation -nobottom_navigation -noauto_navigation -noindex_in_navigation -antialias_text -transparent -white -show_section_numbers -show_init

The translation was initiated by Luca Veraldi on 2006-04-16

Luca Veraldi 2006-04-16