1 Write code

C code uses curly braces, semi-colons, and type names before variable names during declarations

#define three (1+2)

int printf(const char *format, ...);

int main() {
    // this is a comment
    double x;
    int y = three;
    /*
    this is also a comment
    */
    x = y>>1;
    printf("%g\n", x);
    return 0;
}

2 Preprocess

The C pre-processor removes comments, handles any lines that begin with #, and may change some other parts.

You can invoke the C pre-processor with the command cpp. It will add a lot of lines beginning with # which are only to help subsequent error messages make sense and we can ignore.

int printf(const char *format, ...);

int main() {
    double x;
    int y = (1+2);
    x = y>>1;
    printf("%g\n", x);
    return 0;
}

The preprocessing can generate errors, such as for unmatched /* and */ or invalid # lines.

3 Lex

A lexer (also called a tokenizer) identifies the tokens in the file. For example, it knows that int is an identifier-type token, ( is a left parenthesis-type token, >> is a binary operator-type token, and so on.

Lexing disambiguates some things, such as deciding that >> is one shift token, not two greater-than tokens and that 1.2 is a single token but x.y is three tokens.

Lexing also removes most non-semantic of style like spacing.

int printf ( const char * format , ... ) ; int main ( ) { double x ; int y = ( 1 + 2 ) ; x = y >> 1 ; printf ( "%g\n" , x ) ; return 0 ; }

Lexing can generate errors if you type some symbol that’s not part of C

Lexers are generally built as an extension of a finite automata, a topic you’ll learn about in CS 3120.

4 Parse

A parser combines the tokens into a tree structure called an abstract syntax tree or AST.

There are several ways we could make an AST, but one of them is

function declaration
- int
- printf
- arguments
  1. declaration
    - type
      - const
      - char
      - *
    - name
      - format
  2. ...
function definition
- int
- main
- arguments
- body
  1. definition
    - double
    - x
  2. initialization
    - int
    - y
    - +
      - 1
      - 2
  3. assignment
    - x
    - >>
      - y
      - 1
  4. invocation
    - printf
    - arguments
      1. "%g\n"
      2. x
  5. return
    - 0

Note that not all tokens make it into the AST; some, like parentheses and semi-colons, help organize the tree but can then be abstracted away (hence the word abstract in abstract syntax tree).

Parsers can generate errors if you write something that doesn’t look like C code.

Parsers are related to, but generally not actually implemented as, pushdown automata, a topic you’ll learn about in CS 3120. The actual design of parsers is discussed in CS 4610 and CS 4620.

5 Type Checking

A type checker annotates the AST with the type of each expression and variable, adding implicit type-casts where needed.

function declaration
- int
- printf
- arguments
  1. declaration
    - type
      - const
      - char
      - *
    - name
      - format
  2. ...
function definition
- int
- main
- arguments
- body
  1. definition
    - double
    - x (double)
  2. initialization
    - int
    - y (int)
    - + (int)
      - 1 (int)
      - 2 (int)
  3. assignment
    - x (double)
    - (cast to double)
      - >> (int)
        
        y (int)
        
        1 (int)
  4. invocation
    - (int) printf (const char *, …)
    - arguments
      1. "%g\n" (const char *)
      2. x (double)
  5. return
    - 0 (int)

Type checking can generate errors if the types of values and operators do not match, if the declared type does not match usage, or if you failed to declare something.

Type checking in C is done in a single pass, top to bottom, which means that declarations must precede usage.

This is valid C

int f();
int g() {
    return f();
}

But this is not valid C; in particular, it will fail during type-checking with an undefined identifier error on the first use of f.

int g() {
    return f();
}
int f();

After type checking, the declarations can all be removed from the AST as they’ve been copied to where they were used.

6 Code Generation and Optimization

Code generation is the process of turning an annotated AST into assembly. Typically, this is done with several intermediate steps, varying by compiler, but usually looking something like

Turn AST into an assembly-like language with infinite registers
Apply various optimizations, such as
- removing code that doesn’t have a lasting impact on the program
- replacing variables that have a constant value with literals
- re-ordering code to reduce the number of jumps
- using the same register for different values
- ISA-specific tricks like replacing 5*x with x + x*4 which can be implemented with a lea instruction
Allocate registers and memory locations
- a simple (non-optimal) way places all variables in memory
- optimizations try to find ways to have some variables only in registers
Assemble the results

Code generation never generates errors.

The level of optimization can be controlled with various flags to the compiler. The most common are

-O0 meaning no optimization at all: every variable is written to memory and every line of C code has one or more lines of machine code
-O1 meaning basic optimization: use registers where possible, but every line of C code has one or more lines of machine code
-O2 meaning normal optimizations: change code in ways that always make it faster
-O3 meaning aggressive optimizations: change code in ways that usually makes it faster, but might also make it bigger or slower in certain circumstances
-Os meaning optimize for program size, not program speed

Each of those -O flags enables a group of optimizations which can be individually enabled or disabled with -f... flags like -finline-functions (which will replace some calls with the entire code of the called function) and -fno-inline-functions (which disables -finline-functions).

Code generation, together with all previous steps of compilation, are performed by running clang -c myfile.c (or gcc -c myfile.c). To stop before assembling, use -S instead of -c.

The assembly generated by clang -c -Os on the file used above is

50                      push   %rax
f2 0f 10 05 00 00 00    movsd  0x0(%rip),%xmm0
00 
bf 00 00 00 00          mov    $0x0,%edi
b0 01                   mov    $0x1,%al
e8 00 00 00 00          callq  15
31 c0                   xor    %eax,%eax
59                      pop    %rcx
c3                      retq

Note the optimization: the code now just says %al is 1 which is what the previous math ((1+2)>>1) evaluates to; the floating-point equivalent of that is placed in a memory location to be loaded by the movsd command.

Three of those instructions (movsd, the first mov, and call) have placeholder addresses initialized to all 00 bytes which will be changed during linking.

The output of code generation is not the finished assembly or machine code we’ll run. Rather, it is something called a relocatable object file that contains

The machine code, with placeholders for addresses so that multiple files can be merged without address collision
Constants and global variables
The names of functions and variables that other programs can use
If compiled with -g, a rough mapping between source code and the resulting machine code to aid in debugging

7 Linking

Most code depends on code written by others. There are two ways to connect code together; one is called static linking or often simply linking.

Static linking combines several relocatable object files into a single executable. It places each object file’s assembly into memory, updates all jump and call targets to those new locations, and links up uses from one file with definitions in other files.

Static linking can generate errors if two of the object files both define the same name.

Dynamic linking is like a deferred static linking. It has the same general goals: to connect multiple files. But it picks some of those files and says don’t bundle it with the binary I’m creating now; instead, load it from a separate file each time I run the program. Dynamic linking is valuable in saving disk space and reducing memory usage; it can support fixing a bug in a library for all programs without recompiling them; but it cal also lead to errors where two programs expect different versions of the same library.

If you run clang or gcc without the -c or -S flags, it will perform linking. You can also run the linker directly as ld, though we’ll never have you do that.

The assembly generated by clang -Os on the file used above is about 150 assembly instructions including a staticly-linked printf that invokes a dynamically-linked printf@GLIBC_2.2.5 where GLIBC_2.2.5 is the name of a library file to be loaded at runtime. It also changes addresses in main to resolve linkage, giving

50                      push   %rax
f2 0f 10 05 d7 0e 00    movsd  0xed7(%rip),%xmm0
00 
bf 10 20 40 00          mov    $0x402010,%edi
b0 01                   mov    $0x1,%al
e8 f3 fe ff ff          callq  401030
31 c0                   xor    %eax,%eax
59                      pop    %rcx
c3                      retq

Note that xmm0 now has the address of the floating-point literal, edi now has the address of the string literal, and callq now goes to the address of the statically-linked printf function.

8 Loading

Loading takes a program in a file and puts it into memory so it can run. It resolves dynamic links and may move some memory around to reduce the chance that certain security vulnerabilities will work.

Loading is performed by the operating system, and most OSes provide several ways to do it, such as

clicking on an icon in your operating system
typing ./myprogram in a terminal window
if the directory containing myprogram is in the path, typing myprogram in a terminal window
using the execve library function or its relatives