Compilation Example
Write code
C code uses curly braces, semi-colons, and type names before variable names during declarations
Example:
##define three (1+2)
int printf(const char *format, ...);
int main() {
// this is a comment
double x;
int y = three;
/*
this is also a comment
*/
x = y>>1;
printf("%g\n", x);
return 0;
}
Preprocess
The C pre-processor removes comments, handles any lines that begin with #
, and may change some other parts.
You can invoke the C pre-processor with the command cpp
{.sh}. It will add a lot of lines beginning with #
which are only to help subsequent error messages make sense and we can ignore.
Example:
int printf(const char *format, ...);
int main() {
double x;
int y = (1+2);
x = y>>1;
printf("%g\n", x);
return 0;
}
The preprocessing can generate errors, such as for unmatched /*
and */
or invalid #
lines.
Lex
A lexer (also called a “tokenizer”) identifies the “tokens” in the file. For example, it knows that int
is an “identifier”-type token, (
is a “left parenthesis”-type token, >>
is a “binary operator”-type token, and so on.
Lexing disambiguates some things, such as deciding that >>
is one shift token, not two greater-than tokens and that 1.2
is a single token but x.y
is three tokens.
Lexing also removes most non-semantic of style like spacing.
Example: int
printf
(
const
char
*
format
,
...
)
;
int
main
(
)
{
double
x
;
int
y
=
(
1
+
2
)
;
x
=
y
>>
1
;
printf
(
"%g\n"
,
x
)
;
return
0
;
}
Lexing can generate errors if you type some symbol that’s not part of C
Lexers are generally built as an extension of a finite automata, a topic you’ll learn about in CS 3120.
Parse
A parser combines the tokens into a tree structure called an “abstract syntax tree” or AST.
Example: There are several ways we could make an AST, but one of them is
- function declaration
int
printf
- arguments
- declaration
- type
const
char
*
- name
format
- type
...
- declaration
- function definition
int
main
- arguments
- body
- definition
double
x
- initialization
int
y
+
1
2
- assignment
x
>>
y
1
- invocation
printf
- arguments
"%g\n"
x
- return
0
- definition
Note that not all tokens make it into the AST; some, like parentheses and semi-colons, help organize the tree but can then be abstracted away (hence the word “abstract” in “abstract syntax tree”).
Parsers can generate errors if you write something that doesn’t look like C code.
Parsers are related to, but generally not actually implemented as, pushdown automata, a topic you’ll learn about in CS 3120. The actual design of parsers is discussed in CS 4610 and CS 4620.
Type Checking
A type checker annotates the AST with the type of each expression and variable, adding implicit type-casts where needed.
Example:
- function declaration
int
printf
- arguments
- declaration
- type
const
char
*
- name
format
- type
...
- declaration
- function definition
int
main
- arguments
- body
- definition
double
x
(double)
- initialization
int
y
(int)+
(int)1
(int)2
(int)
- assignment
x
(double)- (cast to double)
>>
(int)y
(int)1
(int)
- invocation
- (int)
printf
(const char *, …) - arguments
"%g\n"
(const char *)x
(double)
- (int)
- return
0
(int)
- definition
Type checking can generate errors if the types of values and operators do not match, if the declared type does not match usage, or if you failed to declare something.
Type checking in C is done in a single pass, top to bottom, which means that declarations must precede usage.
Example: This is valid C
int f();
int g() {
return f();
}
But this is not valid C; in particular, it will fail during type-checking with an “undefined identifier” error on the first use of f
.
int g() {
return f();
}
int f();
After type checking, the declarations can all be removed from the AST as they’ve been copied to where they were used.
Code Generation and Optimization
Code generation is the process of turning an annotated AST into assembly. Typically, this is done with several intermediate steps, varying by compiler, but usually looking something like
- Turn AST into an assembly-like language with infinite registers
- Apply various optimizations, such as
- removing code that doesn’t have a lasting impact on the program
- replacing variables that have a constant value with literals
- re-ordering code to reduce the number of jumps
- using the same “register” for different values
- ISA-specific tricks like replacing
5*x
withx + x*4
which can be implemented with alea
instruction
- Allocate registers and memory locations
- a simple (non-optimal) way places all variables in memory
- optimizations try to find ways to have some variables only in registers
- Assemble the results
Code generation never generates errors.
The level of optimization can be controlled with various flags to the compiler. The most common are
-O0
meaning no optimization at all: every variable is written to memory and every line of C code has one or more lines of machine code-O1
meaning basic optimization: use registers where possible, but every line of C code has one or more lines of machine code-O2
meaning normal optimizations: change code in ways that always make it faster-O3
meaning aggressive optimizations: change code in ways that usually makes it faster, but might also make it bigger or slower in certain circumstances-Os
meaning optimize for program size, not program speed
Each of those -O
flags enables a group of optimizations which can be individually enabled or disabled with -f...
flags like -finline-functions
(which will replace some call
s with the entire code of the called function) and -fno-inline-functions
(which disables -finline-functions
).
Code generation, together with all previous steps of compilation, are performed by running clang -c myfile.c
(or gcc -c myfile.c
). To stop before assembling, use -S
instead of -c
.
Example: The assembly generated by clang -c -Os
on the file used above is
50 push %rax
f2 0f 10 05 00 00 00 movsd 0x0(%rip),%xmm0
00
bf 00 00 00 00 mov $0x0,%edi
b0 01 mov $0x1,%al
e8 00 00 00 00 callq 15
31 c0 xor %eax,%eax
59 pop %rcx
c3 retq
Note the optimization: the code now just says %al
is 1
which is what the previous math ((1+2)>>1
) evaluates to; the floating-point equivalent of that is placed in a memory location to be loaded by the movsd
command.
Three of those instructions (movsd
, the first mov
, and call
) have placeholder addresses initialized to all 00 bytes which will be changed during linking.
The output of code generation is not the finished assembly or machine code we’ll run. Rather, it is something called a “relocatable object file” that contains
- The machine code, with placeholders for addresses so that multiple files can be merged without address collision
- Constants and global variables
- The names of functions and variables that other programs can use
- If compiled with
-g
, a rough mapping between source code and the resulting machine code to aid in debugging
Linking
Most code depends on code written by others. There are two ways to connect code together; one is called “static linking” or often simply “linking”.
Static linking combines several relocatable object files into a single executable. It places each object file’s assembly into memory, updates all jump and call targets to those new locations, and links up uses from one file with definitions in other files.
Static linking can generate errors if two of the object files both define the same name.
Dynamic linking is like a deferred static linking. It has the same general goals: to connect multiple files. But it picks some of those files and says “don’t bundle it with the binary I’m creating now; instead, load it from a separate file each time I run the program.” Dynamic linking is valuable in saving disk space and reducing memory usage; it can support fixing a bug in a library for all programs without recompiling them; but it cal also lead to errors where two programs expect different versions of the same library.
If you run clang
or gcc
without the -c
or -S
flags, it will perform linking. You can also run the linker directly as ld
, though we’ll never have you do that.
Example: The assembly generated by clang -Os
on the file used above is about 150 assembly instructions including a staticly-linked printf
that invokes a dynamically-linked printf@GLIBC_2.2.5
where GLIBC_2.2.5
is the name of a library file to be loaded at runtime. It also changes addresses in main to resolve linkage, giving
50 push %rax
f2 0f 10 05 d7 0e 00 movsd 0xed7(%rip),%xmm0
00
bf 10 20 40 00 mov $0x402010,%edi
b0 01 mov $0x1,%al
e8 f3 fe ff ff callq 401030
31 c0 xor %eax,%eax
59 pop %rcx
c3 retq
Note that xmm0
now has the address of the floating-point literal, edi
now has the address of the string literal, and callq
now goes to the address of the statically-linked printf
function.
Loading
Loading takes a program in a file and puts it into memory so it can run. It resolves dynamic links and may move some memory around to reduce the chance that certain security vulnerabilities will work.
Loading is performed by the operating system, and most OSes provide several ways to do it, such as
- clicking on an icon in your operating system
- typing
./myprogram
in a terminal window - if the directory containing
myprogram
is in “the path”, typingmyprogram
in a terminal window - using the
execve
library function or its relatives