C (guide and reference)
This is intended to be a practical guide (rather than an authoritative guide) to C, as implemented by clang and gcc for the x86-64 processor family.
Data Types
The sizeof(...)
operator returns the size of a type in bytes. Thus, sizeof(int)
is 4
, not 32
.
Primitive
Integer
The integer data types are
name | bits | representation | notes |
---|---|---|---|
_Bool | 1 or more | undefined | rarely used; for all types, 0 is false, anything else is true |
char | 8 | signedness undefined | usually used for characters, sometimes for bytes |
signed char | 8 | 2’s complement | |
unsigned char | 8 | unsigned integer | |
short | 16 | 2’s complement | |
int | 32 | 2’s complement | |
long | 32 or 64 | 2’s complement | 32 bits if compiled with -m32 , 64-bit if compiled with -m64 , compiler’s choice if neither |
long long | 64 | 2’s complement |
Each has an unsigned
version (e.g., unsigned short
, etc). If unsigned
is used as a type by itself, it means unsigned int
.
Integer literals will be implicitly cast to the correct type upon assignment; thus char x = -3
will turn -3
into an 8-bit value automatically, and int x = 'x'
will turn 'x'
into a 32-bit value. This only works up to int-sized literals.
To force a literal to be long add a l
or L
to the end; to force it to be unsigned add a u
or U
. This is generally only needed for very large constants, like unsigned long very_big = 9223372036854775808uL
.
Character literals are integer literals written with a different syntax. There is no significant difference between '0'
and 48
other than legibility.
Floating-point
The floating-point datatypes are
name | exponent bits | fraction bits | total size | literal syntax |
---|---|---|---|---|
float | 8 | 23 | 32 bits (4 bytes) | 3.1415f (f or F for float ) |
double | 11 | 52 | 64 bits (8 bytes) | 3.1415 (no suffix) |
long double | 15 | 64 | 80 bits (10 bytes) | 3.1415L (l or L for long ) |
Note that long double
has traditionally only differed from double
on x86 architectures.
Enumerations
The enum
keyword is a special way of defining named integer constants, in ascending order unless otherwise specified.
enum { a, b, c, d=100, e };
/* a is 0, b is 1, c is 2, d is 100, and e is 101 */
int f = e; /* equivalent to f = 101 */
void and casting
There is also a special void
type that means either “a byte with no known meaning” (if used as part of a pointer type) or “nothing at all” (if used as a return type or parameter list).
Casting between integer types truncates (if going smaller) or zero- or sign-extends (if going larger, depending on the signedness of the value) to fit the available space. Casting to or from floating-point types converts to a nearby representable value (which may be infinity), with the exception that casting from float to int truncates the reminder instead of rounding.
Pointers
For every type, there is a type for a pointer to a value of that type. These are written with a *
after the type:
int *x; /* points to an int */
char *s; /* points to a char */
float **w; /* points to a pointer that points to a float */
float ***a; /* points to a pointer that points to a pointer that points to a float */
A pointer to any value stored in memory can be taken by using the address-of operator &
. Thus &x
is the address of the value stored in x
, but &3
is an error because 3 is a literal and does not have an address. You also can’t take the address of the result of an expression: &(x + y)
or &&x
are both errors as well.
You de-reference pointers with the same syntax used to declare them: a *
before the variable.
int *x = &z; /* x = pointer to z */
int x1 = *x; /* x1 == z */
You can also de-reference pointers with subscript notation; *x
and x[0]
are entirely equivalent, as are *(x + n)
and x[n]
.
There is syntactic ambiguity when combining *
and [1]
. Is *a[1]
the same as *(a[1])
or (*a)[1]
? This is solved by operator precedence ([]
before *
), but is not intuitive to most programmers so you should always use parentheses in these cases.
All pointers are the same size (the size of an address in the underlying ISA) regardless of the size of what they are pointing to; thus sizeof(char *) == sizeof(long double *)
. Two special int types1 are used to be “an integer the size of a pointer”: size_t
is an unsigned
integer of this size, and ssize_t
is a signed
integer of this size. With the compilers and ISAs we are using this semester size_t
is the same as unsigned long
and ssize_t
is the same as long
.
When you add an integer to a pointer, the address stored in the pointer increases by a multiple of the sizeof
the pointed-to type.
int x = 10;
int *y = &x; // y points to x
int *z = y + 2; // z points 2 ints after x
long w = ((long)z) - ((long)y); // w is 8, not 2.
Compound Types
There are two basic compound types in C: the struct
and the array.
Array
An array is zero or more values of the same type stored contiguously in memory.
int array[1000]; /* an array of 1000 int values */
Except when used with sizeof
and &
, arrays act exactly like pointers to their first element; notably, this means that array[23]
does what you expect it to do: access the 24th element of the array.
The sizeof
an array is the total bytes used by all elements of the array:
unsigned x = sizeof(array); /* 4000: sizeof(int) * 1000 */
The &
of an array is the &
of its first element (i.e., &array == &(array[0])
).
Parentheses are allowed when declaring types, although their meaning is counter-intuitive to many students:
char *pc[10]; /* an array of 10 (char *)s */
char *(pc[10]); /* an array of 10 (char *)s */
char (*pc)[10]; /* a pointer to an array of 10 (char)s */
The rule here is that we declare variables exactly as we would use them: a pointer to an array would first be dereferenced ((*pc)
) and then indexed ((*pc)[i]
) to get a char
so we declare it as char (*pc)[10]
.
Arrays literals use curly braces and commas.
int x[10] = {1, 1, 2, 3, 5, 8, 13, 21, 34, 55};
Unless initialized with a literal like this, the contents of an array are undefined (i.e., may be any random values the compiler thinks is most efficient) when created.
Arrays cannot be resized after being created.
Struct
A struct
also stores values contiguously in memory, but the values may have different types and are accessed by name, not index.
struct foo {
long a;
int b;
short c;
char d;
}; /* note the ; at the end; it is REQUIRED! */
The name of the resulting type includes the word struct
struct foo x;
unsigned long y = sizeof(struct foo);
x.b = 1234;
x.a = x.b - 5;
Compilers are free to lay out the data elements of a structure with padding between elements if they wish; this is often done in practice because memory tends to be faster when the address of a 4-byte value is a multiple of 4, so in the above example we expect y
to have a value larger than the minimal 15 bytes needed to store the fields of struct foo
.
Structures are passed by value; that is, using them as arguments, return types, or with =
means that all of their fields are copied. This is inefficient for all by the smallest structs
, so often pointers to structures are passed, not the structures themselves.
Because all pointers are the same size, you can have code use a pointer to a struct
without knowing what is inside the struct
; the fields only need to be known for sizeof
and the .
operator to work, not for parameter passing.
struct baz; /* just says "a struct of this name exists" */
void swizzle(struct baz *p); /* just says "a function of this name exists" */
/* Swizzles an array of struct bazs *
* This code does not need to understand what a struct baz is */
void swozzle(struct baz **x, int n) {
for(int i=0; i<n; i+=1) swizzle(x[i]);
}
Structure literals are written using curly braces and commas, optionally with .fieldname =
prefixes
struct a {
int b;
double c;
};
/* Both of the following initialize b to 0 and c to 1.0 */
struct a x = { 0, 1.0 };
struct a y = { .b = 0, .c = 1.0 };
Unless initialized with a literal like this, the values of fields of a struct are undefined (i.e., may be any random values the compiler thinks is most efficient) when created.
typedef
You can give new names to any type by using the typedef
statement:
typedef int Integer;
Integer x = 23;
typedef double ** dpp;
double y0 = 12.34;
double *y1 = &y0;
dpp y = &y1;
struct foo { int x; double y; };
typedef struct foo foo;
foo z;
z.x = x;
z.y = **y;
typedef
type names are aliases to the old names; the compiler will treat both the original and new name as equivalent in all type checking.
Sometimes typedef
is used with anonymous struct
s:
typedef struct { int x; double y; } foo;
foo z;
z.x = 3;
Anonymous struct
s can also be used directly as type names, though that’s quite uncommon
struct { int x; double y; } z;
z.x = 3;
Union
A union is like a struct, except that all of the fields are stored in the same memory address. In practice, this means only one of them has a meaningful value at a time.
union odd {
long long i;
double d;
};
union odd x;
x.i = 0x1234; /* x's memory now contains 34 12 00 00 00 00 00 00 */
double y = x.d; /* y is now 2.30235e-320 (those same bytes) */
x.d = 0x1234; /* x's memory now contains 00 00 00 00 00 34 b2 40 */
long long z = x.i; /* z is now 0x40b2340000000000 (those same bytes) */
You can do bad things
C does not try to prevent you from doing bad things.
float x = 123.567f; /* A floating-point number */
int y = *((int *)&x); /* An integer made from the same bytes as the floating-point number */
int z[4]; /* An array of 4 integers */
int w = z[254]; /* An integer made from the contents of memory 1000 bytes after the end of z */
const char *s = "hi"; /* compiler makes the string in memory the OS won't allow us to change */
char *t = (char *)s; /* we get a pointer to that memory that C will allow us to change */
t[0] = 'H'; /* we try to change that memory (the OS will crash our program) */
C’s general attitude is “every rule has an exception” and “the programmer knows best”. It might make you do some complicated casting to do things, but it won’t stop you if you are determined.
Constant
If a type is preceded by const
, the compiler is free to perform optimizations that assume that no code will ever change the values of this type after they are first initialized.
As a special syntax, a string literal like "hello"
does two things:
- It ensures there exists somewhere an array of characters
{'h', 'e', 'l', 'l', 'o', 0}
, typically in read-only memory.- Note the
0
at the end (that’s byte-0 not character-0). This is how C knows the string is over.
- Note the
- It returns a
const char *
pointing to theh
.
Storage Classes
Variable declarations can optionally have one or more storage class. The most important of these are:
const
- Important enough to have it’s own section: Constant
static
inside a function- Declaring a local variable as
static
makes it a global variable that only that one function can see.Example:
int prev(int y) { static int q = 0; // only initialized once int ans = q; q = y; return ans; } int main() { printf("%d\n", prev(3)); printf("%d\n", prev(1)); // q = 5; // <- error, only prev can access q printf("%d\n", prev(4)); printf("%d\n", prev(1)); }
prints
0 3 1 4
If you need a function to have persistent state, a
static
local is the preferred way to do that. static
for a global- Declaring a global variable as
static
makes it visible only inside that.c
file.If you need a few related functions to share some persistent state, a
static
global is the preferred way to do that. If just one function needs the state, use astatic
local instead. extern
for a global- Declaring a global variable as
extern
tells the compiler “assume it exists, don’t create it. It will be supplied by a different.c
file. Connect the two during linking.”In general, globals are declared as
extern
in.h
files and re-declared withoutextern
in exactly one.c
file.If you need a many functions to all access the same persistent state, use a non-
static
global declared asextern
everywhere it is used except in one.c
file. Non-static
globals are associated with various bugs, so you should prefer astatic
global orstatic
local instead where possible. register
,volatile
, andrestrict
- These are hints to the compiler. They have no direct impact on code operation, but can make some optimizations work better.
register
suggests that this variable be stored in a register, not in memory, and is ignored by many compilers.volatile
tells the optimizer to assume the variable is being changed by something other than the code itself and blocks certain kinds of optimizations that might otherwise assume differently.restrict
can only be used with pointers, and tells the compiler it is safe to assume that no other pointer describes the same memory as this pointer, enabling certain optimizations that might otherwise not be permitted.
Control constructs
Braces and scope
Any statement may be replaced with a sequence of statements inside braces. Variables declared inside a set of braces vanish at the end of those braces.
int x;
{
int y;
x = y; /* OK, both x and y in scope */
}
y = x; /* ERROR: y is no longer in scope */
Flow of control
Nice and common ones
if
Any statement may be preceded by if ( ... )
; the statement will only be executed if the expression inside the parentheses yields a non-zero value.
Any statement following a statement preceded by if ( ... )
may be preceded by else
; the statement will only be executed if the expression inside the if
’s parentheses yields a zero value.
while
Any statement may be preceded by while ( ... )
; the statement will only be executed if the expression inside the parentheses yields a non-zero value, and will continue to be executed until that condition stops being true.
for
The special construct for (e1; e2; e3) s;
is equivalent to the following:
{
e1;
while (e2) {
s;
e3;
}
}
with a slight twist: if s
contains a continue
, it jumps to e3
instead of to while (e2)
.
If e2
is omitted, it is assumed to be 1
, so for(;;) s;
repeats s
forever.
Ugly and uncommon ones
do-while
The syntax do s; while (e);
means the same as s; while (e) s;
: that is, it always does s
once before first checking e
. In my experience, this is used for less than 1% of loops.
label and goto
Any line of code may be preceded by a label: which is an identifier followed by a colon (e.g. some_label:
).
The goto some_label;
statement unconditionally jumps to the code identified by that label.
In 1968 Edgar Dijkstra write an article “Go To Statement Considered Harmful”. Since then, the use of goto
in code has dropped significantly; it’s now usually a sign either of over-emphasis on optimization or a shim to avoid having to redesign poorly-organized code. However, there are a few situations where it can be handy, so it does sometimes show up in high-quality code.
switch
The switch
statement in C may be implemented in several ways by the compiler, but it is designed to be a good match for the “jump table” approach.
The syntax of the switch
is as follows:
switch(i) {
case 0:
statements;
break;
case 1:
statements;
break;
case 3:
statements;
break;
case 4:
statements;
break;
default:
statements;
}
Conceptually, this is
- a block of code
- with multiple labels
- where the labels are numbered, not named
and it operates like the (invalid) code
c_code *targets[5] = { (case 0), (case 1), (default), (case 3), (case 4) };
if (0 <= i && i < 5) goto targets[i];
else goto default;
The break
(as with a break
in a loop) stops running the code block and goes to the first statement after it.
Many people think of a switch
as being a nice way to write a long if
/else if
sequence, and are then annoyed by its limitations and quirks: it has to have an integer selector (as this is really an index), and it “falls through” to the next case if there is no break
. Hence the following example, taken from wikipedia:
switch (age) {
case 1: printf("You're one.\n"); break;
case 2: printf("You're two.\n"); break;
case 4: printf("You're four.\n");
case 5: printf("You're four or five.\n"); break;
default: printf("You're not 1, 2, 4 or 5!\n");
}
Because many programmers make mistakes with switch
, it is common to see them banned by style, or augmented with a special style, or later languages to use a similar syntax in ways a jump table cannot handle, or mostly C-compatible languages augmenting them with rules like “each case must either end with break
or with an explicit fallthrough
/goto case
”.
Most compilers have several different implementations of switch
they can pick between; they might use a jump table, a sequence of if
/else if
s, a binary search, etc.
Example: We can change a switch into gotos as follows
switch (age) {
case 1: printf("You're one.\n"); break;
case 2: printf("You're two.\n"); break;
case 4: printf("You're four.\n");
case 5: printf("You're four or five.\n"); break;
default: printf("You're not 1, 2, 4 or 5!\n");
}
void *locations[] = {&&L1, &&L2, &&L6, &&L3, &&L4};
if (age < 1 || age > 4) goto L5;
else goto *locations[age-1];
if (0) {
L1: printf("You're one.\n");
goto L6;
L2: printf("You're two.\n");
goto L6;
L3: printf("You're four.\n");
L4: printf("You're four or five.\n");
goto L6;
L5: printf("You're not 1, 2, 4 or 5!\n");
L6: ;
}
Note that the &&label_name
syntax is a GCC and CLang extension and not part of the official C language.
Functions
Most common use
The most common use of functions in C looks much like you are used to from other languages: a return type, a name, a list of typed parameters in parentheses, and a body in braces.
void baz(int i, char *b, float c) {
b[i] = (char)c;
return;
}
It is also common to declare functions before defining them, in part because C requires functions to be declared before use.
int is_even(unsigned n);
int is_odd(unsigned n);
int is_even(unsigned n) {
if (n == 0) return 1;
else return is_odd(n - 1);
}
int is_odd(unsigned n) {
if (n == 0) return 0;
else return is_even(n - 1);
}
Often the declarations or function headers are put in a separate file, called a header file and traditionally named with the suffix .h
. The #include
directive can thus grab all of these at once, simplifying coding without increasing the size of the resulting .c
file or the compiled binary.
Syntax variations
C allows several variations to standard function syntax. Most of these are consider bad programming style.
-
Function return types can be omitted, defaulting to
int
:min(int a, int b) { return a < b ? a : b; }
-
Function parameter types can be omitted, defaulting to
int
:min(a, b) { return a < b ? a : b; }
-
Function parameter types can be specified between the
)
and the{
:void baz(i, b, c) int i; char *b; float c; { b[i] = (char)c; return; }
Technically, this does something called “promotion” and has various quirks; for this and other reasons it is often called “old-style” and generally discouraged.
-
A zero-argument function can be written as either
int three() { return 3; }
or
int three(void) { return 3; }
-
The
main
function (only) will return0
if it is missing areturn
, and may omit its arguments upon definition.
It’s all convention
C passes arguments using a calling convention. This is obeyed blindly by both the caller and the callee; so if the caller thought the callee had different argument types than it did, neither will notice they have a problem; they’ll just silently do the wrong thing.
Example: Consider the following pair of files:
baz.c
long bar(char *);
/******* adds bar("hello") to its argument *******/
long baz(long x) {
return bar("hello") + x;
}
bar.c
/** returns the requested suffix of "ten letter" */
char *bar(long x) {
char *c = "ten letter";
return c + (x%10);
}
When executed,
baz
will put the address of the first character of"hello"
into the%rdi
register and thencallq bar
.bar
will look in%rdi
for an integer, modulo it by 10, and use it to put an address of a character in the string"ten letter"
into%rax
baz
will look in%rax
for an integer, addx
to it, and return
This is almost certainly not what was wanted, but no part of it violates the rules.
Variadic functions
The number of arguments in a function is known as the function’s arity. Many functions have fixed arity, requiring the same number of arguments each time they are invoked, but sometimes it is nice to have a function that has variable arity, or a variadic function.
In C, when invoking a function of variable arity the invoking code simply follows the calling convention, putting some arguments in registers and others on the stack. The invoked function then needs to know how many arguments it received. Since it can’t tell anything without consulting at least one argument, all variadic functions in C require at least one argument, and almost all use that argument to decide how many (and what type) the other arguments are.
By far the most famous variadic function in C is printf
, which is defined as
int printf(const char *format, ...);
Note the trailing ...
means “this is a variadic function.” Thus, printf
may be invoked with any arguments you want, as long as the first is a const char *
(that is, a string):
printf("%s, %s %d, %.2d:%.2d\n", weekday, month, day, hour, min);
The printf
function uses fairly involved rules about %
s in its first argument to determine how many and what type the other arguments should be.
Writing a variadic function is somewhat complicated by the fact that the extra arguments do not have names. C provides (declared in stdarg.h
) a special data type va_list
and a set of special macros to use in accessing variadic arguments.
void va_start(va_list ap, argN);
type va_arg(va_list ap, type);
void va_end(va_list ap);
To use these, you might do something like
int sign_swaps(int num0, ...) {
va_list ap;
int last = num0;
int ans = 0;
va_start(ap, num0);
while(last != 0) {
int next = va_arg(ap, int);
if ((last < 0) != (next < 0)) ans += 1;
last = next;
}
va_end(ap);
return ans;
}
If you want to write variadic functions, you should
- Read all of
man stdarg.h
- Read all of
man stdarg.h
again, because you almost certainly missed something important - Look up variadic security vulnerabilities like the format string attack
- Write good tests, including too-few- and too-many- and wrong-type-argument invocations.
Preprocessor
Before compilers compile code, they run the C Preprocessor. This does several uninteresting tasks like removing comments, but also processes various macros and directives.
#include <somefile.h>
Looks for somefile.h
in the include path, a set of directories typically including /usr/include
and sometimes a few others.
Upon finding the file, it dumps its entire contents into this part of the file, as if you had copy-pasted it here.
#include "somefile.h"
Looks for somefile.h
in the current source directory and, if not found there, in the include path.
Upon finding the file, it dumps its entire contents into this part of the file, as if you had copy-pasted it here.
#if expression
, #else
, #elif expression
, and #endif
Upon encountering an #if
, the preprocessor evaluates the truth of the expression, which must contain only literals (and operators) because the preprocessor is not running code. If it is false, all code from that #if
to the matching #else
, #elif
, or #endif
is removed from the source code as if you had deleted it.
#else
and #elif expression
behave like else
and else if (expression)
would in C.
#define NAME anything at all
Defines an object-like macro. Anywhere NAME
appears in the source code, this tells the preprocessor to replace it with anything at all
—literally those exact tokens, as if you had done a global find-and-replace in your source file.
#define NAME(a,b,c) anything including a and b and c
Defines a function-like macro. Anywhere NAME(x,y,z)
appears in the source code, this tells the preprocessor to replace it with anything including x and y and z
—that is, it does a find-and-replace with some parameterization.
This is a lexical replacement, not a syntactic one, so you should almost always add parentheses around each argument and around the full expression:
#define TIMES2(x) x * 2 /* bad practice */
#define TIMES2b(x) ((x) * 2) /* good practice */
int x = ! TIMES2(2 + 3); /* int x = ! 2 + 3 * 2; (i.e., !2 + 6 == 6) */
int y = ! TIMES2b(2 + 3); /* int x = ! ((2 + 3) * 2); (i.e., !10 == 0) */
If you decide to become a C expert, there is more to know about macros; see https://en.wikipedia.org/wiki/C_preprocessor#Special_macros_and_directives for a reasonable overview.
#ifdef NAME
, #ifndef NAME
These act like #if
, except instead of checking if something is true they check if a name has been #defined
(#ifdef
) or not (#ifndef
).
A very common use of these macros is to ensure only one copy of an .h
file is included. For example, my_file.h
might look like
#ifndef __MY_FILE_HAS_BEEN_INCLUDED__
#define __MY_FILE_HAS_BEEN_INCLUDED__
/* file contents here */
#endif
This way if a file #include "my_file.h"
twice (as, for example, because it includes two other .h
files that each #include
my_file.h
) then the first one will define __MY_FILE_HAS_BEEN_INCLUDED__
and the second one, seeing __MY_FILE_HAS_BEEN_INCLUDED__
is already defined, will have all its contents removed by the #ifndef
.
Example: If we have something like
foo.h
#ifndef __FOO_H
#define __FOO_H
int foo;
#endif
foo.c
int x;
#include "foo.h"
int y;
#include "foo.h"
int z;
the #include
processing will create
int x;
#ifndef __FOO_H
#define __FOO_H
int foo;
#endif
int y;
#ifndef __FOO_H
#define __FOO_H
int foo;
#endif
int z;
The first #ifndef
is true, since __FOO_H
had not been defineed before that line
int x;
#define __FOO_H
int foo;
int y;
#ifndef __FOO_H
#define __FOO_H
int foo;
#endif
int z;
That means that the second #ifndef
is false, since the first defined __FOO_H
int x;
#define __FOO_H
int foo;
int y;
int z;
__FILE__
and __LINE__
The preprocessor is guaranteed to define __FILE__
as an object-like macro expanding to the name of the current file, in quotes, like "my_file.c"
. The preprocessor is also guaranteed to define __LINE__
as an object-like macro expanding to the line number on which __LINE__
appears, like 23
.
These are often used in debugging messages, as e.g. printf("Error in %s on line %d\n", __FILE__, __LINE__);
.
Because the preprocessor redefines these on its own on each new line of code, they have a special #line
directive to change them if you need to do that (not the usual #define
). The #include
processing and comment removing adds such #line
directives so that it does not change source line numbers.
#error "error message"
Shows error message
as an error message during compilation.
Most C compilers add several other compiler-specific preprocessor directives, like #warning
, #pragma message
, #pragma once
, #include_next
, #import
, etc. Each is added to simplify some common task, but also makes code harder to port to other platforms.
-
Defined using
typedef
in<types.h>
. ↩