Changelog:
- 22 Jan 2023: correct description of Save-Handle-Restore to accurate reflect that the whole state of the processor is generally not saved and that the exception handler is often responsible for saving some of it
- 22 Jan 2023: add aside on nested exceptions and disabling interrupts
- 27 Jan 2023: correct description of what Ctrl+Z does (it sends TSTP not STOP)
- 27 Jan 2023: add description of blocking signals and signal safety
- 27 Jan 2023: expand comment about sa_mask; remove that mention of signal()
- 28 Jan 2023: add section on memory-mapped files and lazy allocation to applications of virtual memory
1 Kernel Mode vs. User Mode
All multi-purposed chips today have (at least) two modes in which they can operate: user mode and kernel mode1. Thus far you have probably only written user-mode software, and most of you will never write kernel-mode software during your entire software development careers.
1.1 Definitions
When running in kernel mode, the hardware allows the software to do all operations the hardware supports. In user mode, instead, some operations are not allowed.
The code that runs in kernel mode is called the kernel.
If the code running in user mode wants to perform one of the prohibited operations, it must instead ask the kernel to do so on its behalf.
1.2 Motivation
Running software in user mode only restricts what that software can do; it does not provide access to any additional functionality. But this restriction provides several advantages.
1.2.1 Limiting Bugs’ Potential for Evil
One reason to not run in kernel mode is to limit the impact a software bug can have. Inside the computer is all of the code and data that handles drawing the screen and tracking the mouse and reacting to keys and so on; I’d rather not be able to make a mistake when writing my Hello World program and accidentally mess up those core functionalities. By running in user mode, if my code tries to touch any of that it will be blocked from doing so: my program will crash, but the computer will keep running.
1.2.2 Each Program Thinks it is in Control
Physical computers have limited physical resources. One of the benefits of running in user mode is that the particular limitations are hidden from you by the kernel and hardware working together. For example, even if you only have 8GB of memory you can still use addresses larger than that limit: the kernel and hardware will work together to provide the illusion that there was enough memory even if there was not.
This illusion is even more important when you are running more than one process simultaneously. Perhaps you have a terminal, an editor, and your developed code all running at once. The kernel and hardware will work together to ensure that your code cannot even see, let alone modify, the memory being used by the terminal or editor. Each program is given the illusion of being the only one running on an unlimited machine.
1.2.3 Wrapper around Hardware
Each piece of hardware attached to a chip has its own protocols and nuances of operation. Fortunately, only the kernel needs to know about those details. If the mouse tries to talk to your program directly, the kernel intercepts the communication and handles it for you so you don’t notice it happened unless you asked the kernel to inform you about it. If you try to talk to the disk, the kernel likewise moderates that communication, keeping you from needing to know the specific protocol the currently-attached disk expects.
1.2.4 Multiple Users
Because the kernel is an unavoidable intermediary for talking to hardware, it can choose to forbid some interactions. One common use of this is to allow multiple user accounts. Since the kernel is the one that controls access to the disk, for example, it can choose to prohibit my processeses from accessing the part of the disk it has assigned to you and not me.
1.3 Implementation
1.3.1 Protected Instructions and Memory
When in user mode, there are a set of instructions that cannot be executed and segments of memory that cannot be accessed. Examples of things you cannot directly access in user mode include
- the part of memory that stores information about which user account you are logged into
- instructions that send signals out of the chip to communicate with disks, networks, etc.
- instructions that let user-mode code execute in kernel mode
1.3.2 Mode Bit
The simplest way to create two modes is to have a single-bit register somewhere on the chip that indicates if the chip is currently executing in user mode or in kernel mode. Each protected instruction then checks the contents of that register and, if it indicates user mode, causes an exception instead of executing the instruction. A special protected instruction is added to change the contents of this mode register, meaning kernel-mode code can become user-mode, but not vice versa.
Modern processors often have more than one operating mode with more than one level of privilege, which (among other benefits) can help make virtualization efficient; the details of these additional modes are beyond the scope of this course.
1.3.3 Mode-switch via Exceptions
The core mechanic for switching to the kernel is a hardware exception. An exception results in several functionally-simultaneous changes to processor state, notably including both (a) changing to kernel mode and (b) jumping to code in kernel-only memory. Thus, the only mechanisms that exist for entering kernel mode will run kernel code for the next instruction, preventing user code from running itself in kernel mode.
The nature of these hardware exceptions and the jump to kernel code that is associated with them is a large enough topic to deserve its own section.
2 Exceptions
Hardware exceptions are examples of exceptional control flow
:
that is, the sequence of instructions being operated around an exception
is an exception to the usual rules of sequentiality and jumps. They also
tend to appear in exceptional (that is, unusual) situations.
2.1 Kinds
There are several kinds of hardware exceptions.
Confusingly, terms for exceptions and particular kinds of exceptions vary across sources. Here, we care that you know that hardware exceptions have many uses rather than knowing specific terms.
In the convention we have followed, we use exception
as the
generic term and use the terms interrupt
or trap
for
particular kinds of exceptions. Some sources use interrupt
or
`trap’ for the generic term instead and use exception
to refer to
a more specific term (probably roughly equivalent to what we call
faults
). There is also some other variation in the definitions of
the more specific terms. So, when reading a processor manual or similar
reference, be careful to look for context to see exactly how they define
each of the terms below.
2.1.1 Interrupts
An interrupt occurs independently from the code being executed when it occurs. It might result from the mouse moving, a network packet arriving, or the like.
Interrupts are typically handled by the kernel in a way that is invisible to the user code. The user code is frozen, the interrupt is handled, and then the user code is resumed as if nothing had happened.
2.1.2 Faults
A fault is the result of an instruction failing to succeed in its execution. Examples include dividing by zero, dereferencing a null pointer, or attempting to execute a protected instruction while in user mode.
There are two basic families of responses to faults: either the kernel fixes the problem and reruns the user code that generated the fault, or it cannot fix the problem and kills the process that caused the fault. In practice, fixable faults happen quite often, notably the page fault discussed later. Even faults the kernel cannot fix on its own are often handled by asking the user code if it knows how to fix them through mechanisms called signals. However, many programs do not choose to handle the signals, so the end result is often still termination of the fault-generating process.
2.1.3 Traps
A trap is an exception caused by a special instruction whose sole job is to generate exceptions. Traps are the main mechanism for intentionally switching from user mode to kernel mode and are the core of all system calls. System calls are typically function-call like interfaces provided by kernels to allow programs to access protected functionality. System calls are used to interact with the file system, spawn and wait on threads, listen for key presses, and anything else that you cannot do with simple code alone.
The system call interface
.
We can view the combination of the limited user-mode hardware interface and system calls as collectively defining the interface user mode code sees.
2.2 Handling
When an exception occurs, the processor switches to kernel mode and
jumps to a special function in kernel code called an exception
handler.
Because interrupts exist, this can happen without any
particular instruction causing the jump, and thus might happen at any
point during code execution. In order for the handler to later resume
user code, the exception-handling process must also save the processor
state before executing the handler.
2.2.1 Save-Handle-Restore
The basic mechanism for any exception to be handled is
- Save (some of) the current state of the processor (program counter, kernel mode bit, etc.)
- Enter kernel mode
- Jump to code designed to react to the exception in question, called an exception handler
- When the handler finishes, enter user mode and restore processor state (program counter, kernel mode bit, etc.)
These steps (except for the actual execution of the exception handler) are all done atomically by the hardware.
Processors vary in how much of the processor state they save and restore in hardware. The PC has to be saved and restored since jumping to the exception handler overwrites the old PC value. The processor may also save and restore the values of all or some of the other registers. If the processor itself does not, then the operating system’s exception handler code can do this instead.
2.2.2 Which Handler?
In general, there can be as many different exception handlers as there are exception causes. To select which one to run, the hardware consults something called an exception table. The exception table is just an array of code addresses; the index into that array is determined by the kind of exception generated (e.g., divide-by-zero uses index 0, invalid instructions use index 6, etc.). The index is called an exception number or vector number and the array of code addresses is called the exception table.
Switches
Having an array of code addresses is not unique to exception
handlers; it is also present in C in the form of a switch
statement.
For example, the following C code
switch(x) {
case 0: f0();
case 1: f1();
case 3: f3();
case 2: f2();
break;
case 4:
case 5: f5();
case 6: f6();
break;
case 8: f6();
}
compiles down to the following assembly code
$8, %eax
cmpl ja AfterSwitch
* Table(,%rax,8)
jmpq Case0:
callq f0Case1:
callq f1Case3:
callq f3Case2:
callq f2jmp AfterSwitch
Case5:
callq f5Case6:
callq f6jmp AfterSwitch
Case8:
callq f8AfterSwitch:
supported by the following jump table
section .rodata
.Table:
.quad Case0
.quad Case1
.quad Case2
.quad Case3
.quad Case5
.quad Case5
.quad Case6
.quad AfterSwitch .quad Case8
Exception tables are just one particular use of this array-of-code-addresses idea.
2.2.3 After the Handler
Handlers may either abort the user code or resume its operation. Aborting is primarily used when there is no obvious way to recover from the cause of the exception.
When resuming, the handler can either re-run the instruction that was
running when it was generated or continue with the next instruction
instead. A trap, for example, is similar to a callq
in its
behavior and thus resumes with the subsequent instruction. A fault
handler, on the other hand, is supposed to remove the barrier to success
that caused the fault and thus generally re-runs the faulting
instruction.
Aborts
We classified exceptions by cause, but some texts classify them by result instead. If classified by what happens after the handler, there is a fourth class: aborts, which never return.
- Fault
- re-runs triggering instruction
- Trap
- runs instruction after triggering instruction
- Interrupt
- runs each instruction once (has no triggering instruction)
- Abort
- stops running instructions
The abort result may be triggered by any cause: if a memory access detects memory inconsistency we have an aborting fault, an exit system call is an aborting trap, and integral peripherals like RAM can send aborting interrupts if they notice unrecoverable problems.
Nested exceptions and disabling interrupts
It is possible for an exception to occur while another exception is being handled. For example, the processor may receive a keypress while a system call is being handled. It is, however, very difficult to write exception handlers that will behave correctly when interrupted in this way. So, processors usually provide a way for processors to disable interrupts, temporarily preventing exceptions from being handled. Usually, interrupts are automatically disabled while an exception handler runs, an operating system will not reenable them unless it has made sure that its data structures are in a state where a new exception can safely be handled.
2.2.4 Example: Linux system calls
In Linux, almost all communications from user- to kernel-code are
handled by a trap with exception number 128. Instead of using the
exception number to select the particular action the user is requesting
of the kernel, that action is put in a number in %rax
; this
decision allows Linux to have space for billions of different system
calls instead of just the 200 or so that would fit in an exception
number and are not reserved by the hardware.
To see a list of Linux system calls, see man 2 syscall
.
Most of these are wrapped in their own library function, listed in man 2 syscalls
.
Consider the C library function socket
. The assembly for
this function (as compiled into libc.so
) begins
socket:
endbr64 mov $0x29,%eax
syscall
Let’s walk through this a bit at a time:
endbr64
-
This is part of the control-flow enforcement elements of x86-64. It has no outwardly visible impact on program functionality, but it does add some security enforcement, making it harder for some classes of malicious code to access this function.
Omitting some (important but tedious to describe) details, because this instruction is present in the function you can’t jump into the code except at that line. That means malicious code can’t jump a bit later into this function with the goal of executing the
syscall
instruction without first going through the%rax
-setting instruction. mov $0x29,%eax
-
Places 2916 (4110) into
%rax
. Per/usr/include/sys/syscall.h
, 41 is the number of thesocket
syscall. syscall
-
A trap, generating exception number 128. This means the hardware saves the processor state, then jumps to the address stored in the kernel’s
exception_handler[128]
.The exception handler at index 128 checks that
%rax
contains a valid system call number (which 41 is), after which it jumps to the kernel’ssystem_call_handler[41]
, the address of the socket call implementation.The kernel’s system calling convention has the same first three arguments as C’s calling convention, so the handler has access to the arguments from the
int socket(int domain, int type, int protocol)
invocation. It uses them to do whatever work is needed to create a socket, placing its file descriptor in%rax
to be a return value.The handler ends by executing a protected return-from-handler instruction that
- restores the processor state (but leaves
%rax
alone as a return value), - returns to user mode, and
- jumps back to user code
- restores the processor state (but leaves
After that is some error-checking code, and then the function
returns. The whole function is only 11 instructions (24 bytes) long. The
code in system_call_handler[41]
of the kernel is
considerably longer; many thousands of lines of C code, in fact (see https://github.com/torvalds/linux/tree/master/net).
System libraries and system call wrappers.
Very little application code directly makes system calls. Sometimes
application use functions like socket
above, which we can
call system call wrappers. Since C does not support making
system calls directly, these functions serve to allow applications
written in languages like C to make system calls indirectly by calling
the simple wrapper function. The system provides libraries that, among
other things, include these wrapper functions.
Often the system call interface is too low-level for typical
applications to use, so the system libraries provide a higher-level
interface that’s implemented using system calls. For example, on Linux
(and all other multiuser OSes I am aware of), malloc
and
printf
are implemented by library code that runs in user
mode that makes simpler system calls.
2.3 Exception-Like Constructs
2.3.1 Signals
One view of exceptions is that they enable communication from user code to the kernel. Signals permit the opposite direction of communication.
→ User code | → Kernel code | → Hardware | |
---|---|---|---|
User code → | ordinary code | Trap | via kernel |
Kernel code → | Signal | ordinary code | protected instructions |
Hardware → | via kernel | Interrupt | — |
Signals are roughly the kernel-to-user equivalent of an interrupt. At any time, while executing any line of code, a signal may appear. It will suspend the currently executing code and see if you’ve told the kernel about a signal handler, a special function you believe will handle that signal. After running the handler, the interrupted code will resume.
Linux defines many different signals, listed in man 7 signal
.
Each of them has a default action if no handler is registered, most
commonly terminating the process.
Typing Ctrl+C in the command line causes the SIGINT signal to be generated. If we want Ctrl+C to do something else, we have to handle that signal:
#include <stdio.h>
#include <signal.h>
#include <unistd.h>
static void handler(int signum) {
// send string to stdout (1) without <stdio.h>
// not using printf() here in case the signal handler runs
// while printf() is running
(1, "Provide end-of-file to end program.\n",
write("Provide end-of-file to end program.\n"));
strlen}
int main() {
struct sigaction sa; // how we tell the OS what we want to handle
.sa_handler = handler; // the function pointer to invoke
sa(&sa.sa_mask); // do not "block" additional signals while signal handler is running
sigemptyset.sa_flags = SA_RESTART; // restart functions that can recognize they were interrupted
sa
if (sigaction(SIGINT, &sa, NULL) == -1) {
(stderr, "unable to override ^C signal");
fprintfreturn 1;
}
char buf[1024];
while(scanf("%1023s", buf) && !feof(stdin)) {
("token: %s\n", buf);
printf}
}
Almost all signals can be overridden, but for many it is not wise to do so. For example, this code:
#include <stdio.h>
#include <signal.h>
static void handler(int signum) {
("Ignoring segfault.\n");
printf}
int main() {
struct sigaction sa; // how we tell the OS what we want to handle
.sa_handler = handler; // the function pointer to invoke
sa(&sa.sa_mask); // do not "block" additional signals while signal handler is running
sigemptyset.sa_flags = SA_RESTART; // restart functions that can recognize they were interrupted
sa
if (sigaction(SIGSEGV, &sa, NULL) == -1) {
(stderr, "unable to override segfault signal");
fprintfreturn 1;
}
// let's make a segfault -- enters infinite handler loop
char * buf;
[1234567] = 89;
buf}
… will print Ignoring segfault.
repeatedly until you kill it
with something like Ctrl+C. The last line creates a segfault, which
causes the OS to run the handler. The handler returns normally, so the
OS assumes the cause of the fault was fixed and re-runs the triggering
code, which generates another segfault.
Obviously, you should not do something like this in real code.
There is a system call named kill
that asks the kernel to send
a signal to a different process. While this can be used for
inter-process communication, better systems (like sockets and pipes)
exist for that if it is to be a major part of an application’s
design.
Command line and signals
From a bash command line, you can send signals to running processes.
Ctrl+C sends SIGINT, the
interrupt
signal that meansI want you to stop doing what you are doing.
Ctrl+Z sends SIGTSTP, a signal whose default action is to suspend the process, freezing it in place until you ask for it to resume.
bg
causes a suspended process to resume running, but in the background so you can use the terminal for other purposes.fg
causes a suspended or background process to resume running in the foreground, re-attaching the keyboard tostdin
.kill <pid>
sends SIGTERM, theterminate
signal, to the process with process ID<pid>
. Theterminate
signal meansI want you to shut down now
but can be handled to save unsaved work or the like.kill -9 <pid>
sends SIGKILL, thekill
signal, to the process with process ID<pid>
. Thekill
signal cannot be handled; it always causes the OS to terminate the program.Any other signal can be provided in similar fashion, either by number (SIGKILL is signal number 9) or name (e.g.,
kill -KILL <pid>
).
There are other tools for sending signals, but the above are sufficient for most common use cases.
2.3.1.1 Signal safety
When a signal handlers runs in the middle of another function, this
might severely limit what operations the signal handler can use. For
example, if a signal handler runs in the middle of a call to
malloc()
(because of when you pressed Ctrl+C), then you
probably cannot call malloc()
from that signal handler. If
you did so, probably malloc’s internal data structures would be
corrupted.
To let programmers avoid this issue, The Unix-like operating system
standard POSIX documents a list of functions that are
async-signal-safe
. Linux reproduces this list in
man 7 signal-safety
. These functions are gaurenteed to
behave correctly even if they signal handler calls them while they are
already running. write
is on this list, which is why I used
it in the signal handler in the example above. Another way to avoid this
problem is to temporarily block signal handlers from running in the
middle of unsafe
operations or explicitly tell the OS when to
handle them, as described in the next section.
2.3.1.2 Blocking signals and handling signals without signal handlers
Sometimes we may not want a signal to be handled right away. For
example, a program with a handler for Ctrl+C that handles saving before
exiting may not want the handler to run while it’s loading a new file.
To do this, you can functions like sigprocmask
to block signals, temporarily disabling signals handlers from
running:
sigset_t sigint_as_set;
sigemptyset(&sigint_as_set);
sigaddset(&sigint_as_set, SIGINT);
sigprocmask(SIG_BLOCK, &sigint_as_set, NULL);
... /* do stuff without signal handler running yet */
sigprocmask(SIG_UNBLOCK, &sigint_as_set, NULL);
If a signal is sent to a process while it is blocked, then the OS
will track that is pending. When the pending signal is
unblocked, its signal handler will be run. Processors typically provide
an analagous mechanism for operating systems to disable
interrupts
.
Unix-like systems also provide functions that work with blocked
signals to allow processing signals without having a signal handler
interrupt normale execution. For example, sigsuspend
will temporarily unblock a blocked signal just long enough to run its
signal handler. Alternately, sigwait
will wait until one of a specific set of signals are sent and
handle
them by returning instead of running any signal
handler.
2.3.2 setjmp
, longjmp
, and software exceptions
Portions of the save- and restore-state functionality used by
exception handlers is exposed by the C library functions
setjmp
and longjmp
.
setjmp
, given a jmp_buf
as an argument,
fills that structure with information needed to resume execution at that
location and then returns 0. Thereafter, longjmp
can be
called with that same jmp_buf
as an argument;
longjmp
never returns, instead returning
from
setjmp
for a second time. longjmp
also
provides an alternative return value for setjmp
.
The following program
#include <setjmp.h>
#include <stdio.h>
;
jmp_buf envint n = 0;
void b() {
("b1: n = %d\n", n);
printf+= 1;
n ("b2: n = %d\n", n);
printf(env, 1);
longjmp("b3: n = %d\n", n);
printf+= 1;
n ("b4: n = %d\n", n);
printf}
void a() {
("a1: n = %d\n", n);
printf+= 1;
n ("a2: n = %d\n", n);
printf();
b("a3: n = %d\n", n);
printf+= 1;
n ("a4: n = %d\n", n);
printf}
int main() {
("main1: n = %d\n", n);
printf+= 1;
n ("main2: n = %d\n", n);
printfint got = setjmp(env);
if (got) {
("main3: n = %d\n", n);
printf+= 1;
n ("main4: n = %d\n", n);
printf} else {
("main5: n = %d\n", n);
printf+= 1;
n ("main6: n = %d\n", n);
printf();
a}
("end of main\n");
printf}
prints
main1: n = 0
main2: n = 1
main5: n = 1
main6: n = 2
a1: n = 2
a2: n = 3
b1: n = 3
b2: n = 4
main3: n = 4
main4: n = 5
end of main
See also a step-by-step simulation of this
same process without the printf
s.
There was a time when setjmp
/longjmp
were
seen as effective ways of achieving
try
/catch
-like error recovery, which cannot be
handled with simple goto
because it may skip multiple
function call/returns.
try /catch constructs |
setjmp /longjmp parallel |
---|---|
|
|
throw new Exception3(); |
longjmp(env, 3) |
More efficient approaches to this have since been developed.
setjmp
records much of the state of the program in advance,
which gives it a significant cost in time; because try
is
generally assumed to be executed far more often than throw
,
we’d rather make throw
the expensive one, not
try
. Most of the information stored by setjmp
into the jmp_buf
can be found somewhere in the
setjmp
-invoking function’s stack frame, which many
languages maintain with sufficient discipline that it can be
unwound
to restore a previous state upon a throw
. C,
however, lets you do anything, including violating assumptions about the
stack organization, so C has an expensive setjmp
and less
expensive longjmp
instead of an expensive
throw
and less expensive try
.
3 Virtual Memory
One of the most important functions provided by a collaboration between the hardware and the operating system is virtual memory. Virtual memory provides
- each process with the illusion that it is the only process running.
- each process with the illusion that all possible addresses exist, no matter how much physical memory is available.
- memory protection such that user-mode code cannot access kernel memory.
- an efficient mechanism for communicating with processes.
3.1 Processes
A process is a software-only operating system construct that roughly parallels the end-user idea of a running program. Each process has its own virtual address space. The hardware knows the difference between user and kernel mode, but not the difference between a web browser and a text editor.
The operating system maintains a variety of data structures inside kernel-only memory to help track different processes. Portions of those structures relating to one process at a time2 are loaded into hardware-accessible registers and memory so that the hardware mechanisms will cause running code to interact with the appropriate process’s memory and other state.
Changing which process’s state is currently loaded is called a context switch. All major user operating systems (but not all embedded-system operating systems) use a hardware timer to create an exception every few dozen milliseconds to enable automated context switching and facilitate the illusion that more processes are running at a time than there are processors in the computer.
3.2 Regions of Memory
To provide process separation, each memory access needs to be checked to ensure it is permitted. Specific sets of permissions may vary by system, but a common set includes
- Read-only vs. read-write memory (e.g. is
mov %rax, address
allowed?) - Kernel-only vs. user-mode memory (e.g. is
mov address, %rax
allowed if not in kernel mode?) - Executable vs. data-only memory (e.g. is
jmp address
allowed?) - Valid vs. invalid addresses (e.g. no code can use address 0)
Because most instructions would need to check at least one of these constraints, there is reason to encode them very efficiently. Because each process may access a very large number of memory addresses, they also need to be efficient in space, meaning that in practice, permissions are applied to large contiguous regions of memory.
3.2.1 Segments revisited
A common internal representation of memory regions used by operating systems is a list of segments, pairs of starting and stopping addresses and a set of permissions for the intermediate addresses.
Segments are never directly visible to the hardware3: instead, the operating system uses the segments to create hardware-visible page table entries and to react to hardware-generated, page-related faults and potentially convert them into signals to convey to the user process.
Because segments are a software-only construct, each operating system can store them in a different way. One simple example might be the following:
struct segment_node {
void *low_address;
void *high_address;
unsigned int permissions_flags;
struct segment_node *next;
}
It is common to refer to the crashing of a program that tries to
access an invalid address as a segfault,
short for
segmentation fault.
This is technically a misnomer, as the fault
is generated by hardware, which does not even know about segments; the
generating fault is either a page fault or a protection fault. The
operating system may react to that fault by generating a segmentation
violation signal,
often abbreviated as SEGV or SIGSEGV.
Many people and programs using the term segfault
are aware
that it is technically incorrect; it is arguably an example of a
computing colloquialism.
POSIX-compatible operating systems provide the library functions
mmap
and munmap
for requesting the operating
system to adjust its segments. Typical user applications use
higher-level wrappers such as malloc
instead.
3.2.2 Pages
Hardware access to memory protections is designed to be very simple so that it can be very rapid. Each address is split into two parts: a page number made up of the high-order bits and a page offset made up of the low-order bits.
Each time memory is accessed in any way—including fetching the instruction itself, meaning at least once per instruction—the address is translated. Translation converts an address as it appears in code (called a virtual address) into an address that can be used to access the memory chip (called a physical address). This translation involves the following steps:
The virtual page number is used as a key or index into a data structure called a page table to retrieve a page table entry (see Page Tables). Conceptually this is what Java would call a
Map<VirtualPageNumber, PageTableEntry>
. It is stored in memory, using physical, not virtual, addresses, and its location is stored in a register that is changed on each context switch so that the operating system can maintain a different page table for each process.The page table entry contains a variety of protection or flag bits. In addition to the various read/write/execute/kernel permission bits, it may also contain an
unmapped
flag indicating that this virtual page has no corresponding physical page.If the access cannot proceed for any reason, a fault is generated (a page fault if unmapped, otherwise a protection fault). The fault handler should either modify the page table so that the address will work on a second attempt or kill the program.
If the necessary permissions are present, the page table entry contains a physical page number. This physical page number, concatenated with the page offset from the virtual address, is the physical address.
Hardware uses various optimizations to speed up this process, notably using the Translation Lookaside Buffer.
3.2.3 Protection… but side channels?
Address translation is designed so that the hardware guarantees that
- Only kernel-mode code can change the page table.
- All memory accesses are guarded by the page table.
Combined with careful coding of the operating system, it is intended that these provide full separation of processes, so that no two programs can see the contents of the other’s memory unless they have both asked the operating system to enable that.
Since 2017, it has become known that these protections are not quite as complete as they were once thought to be. Although these protections do ensure that one process cannot directly access another’s memory, they do not guarantee the absence of side-effects of memory access that another process can detect. For example, many hardware accelerations are based on the principle that what happens once is likely to happen again, so timing how long a memory access takes can leak some information about what other memory accesses other processes may have made.
This is one example of how a side channel might exist, and why it is challenging to reason about their absence. Securing primary channels is easy to think about, even if not always easy to do, because the data in question is known. That the timing of individual memory accesses was information that needed protection was not even known for decades after it became a potential side channel.
3.3 Page Tables
A page table is conceptually just a map between virtual page numbers (which are fixed-size integers) and page table entries (which are each a physical page number and a set of Boolean flags).
3.3.1 Single-level page tables
In the early days of virtual memory, when the entire addressable space was not large, a single array could suffice for implementing an efficient page table. A special kernel-access-only register, called the page table base register or PTBR, stores the address of this array.
Consider a system with 16-bit virtual addresses. If that memory was broken into 27 pages of 29 bytes each, and if each page table entry consists of 6 flag bits and a 10-bit physical page number, then the single-level page table is an array containing 27 2-byte integers (i.e., 28 bytes in total).
If the virtual address is 1010 1111 0000 01012, then the page offset is 1 0000 01012, and the virtual page number is 101 01112. Thus, the hardware accesses page table entry 101 01112—that is, physical address PTBR + 1010 11102—to find a page table entry. If that page table entry’s flags allow access and it has the physical page number, say, 11 0011 00112, then the resulting physical address is the physical page number (11 0011 00112) concatenated with the page offset (1 0000 01012) for a physical address of 110 0110 0111 0000 01012.
Note that this example maps 16-bit virtual addresses to 19-bit physical addresses: each process is limited to 64KB of addressable memory but the hardware can support 512KB physical RAM.
Single-level page tables are inefficient for 32-bit addresses and basically untenable for 64-bit addresses.
3.3.2 Multi-level page tables
Multi-level page tables are a particular instance of a linked data structure known as a high-arity (or wide node) fixed-depth tree, which is an efficient way for storing large, sparse arrays, particularly when entries are clustered. Since this is not a data structure that is generally taught in data structure courses, this section provides a full overview of the data structure itself before discussing how they are used for virutal memory.
3.3.2.1 Fixed-depth high-arity trees
A tree is a linked data structure where each node contains an array of pointers to additional nodes. The best-known tree types use arity-2 (also known as binary or width-2) trees, where the arrays are all 2 elements long, but trees with higher-arity are common when hardware is involved.
Many trees have variable depth: the number of steps from the root to the node containing the desired data may be different for different data. That design allows the tree to scale in size, but also mean that conditional checks are needed at each node. Alternatively, a tree can be designed with a fixed depth, with all data stored the same number of steps from the root. Fixed-depth trees typically store all data in the leaves of the tree, with separate data design for the internal nodes and the leaf nodes.
The following code defines a tree with arity 16 and three levels of intermediate nodes. It has a fixed height of (depending on how you count) 3 or 4.
#define WIDTH 16
struct leaf {
[WIDTH];
PAYLOAD data};
struct height1 {
struct leaf *kids[WIDTH];
};
struct height2 {
struct height1 *kids[WIDTH];
};
struct height3 {
struct height2 *kids[WIDTH];
};
/* ... */
struct height3 *root = /* ... */
This tree has the ability to store 164 = 216 =
65,536 PAYLOAD
values using a total of 163 +
162 + 161 + 1 = 4,369 pointers, including the
root
pointer.
Fixed-depth trees with power-of-two width in each node can be accessed with integer indexes, like an array, by using bit masks to turn indexes into tree navigation. The bits of the index are split up, using the high-order bits for the first tree node index and lower-order bits for subsequent nodes until a leaf is reached.
In the following figure, which uses width 8 instead of 16 to avoid over-cluttering the figure, the value 23 has index 100 111 011 1002.
Continuing the previous code example, each tree can be treated as if it were an array of 65,536 entries:
(struct height3 *root, unsigned short index) {
PAYLOAD arrstruct height2 *child1 = root [(index>>12)&(WIDTH-1)];
struct height1 *child2 = child1[(index>>8)&(WIDTH-1)];
struct leaf *child3 = child2[(index>>4)&(WIDTH-1)];
return child3[index&(WIDTH-1)];
}
Such a structure is significantly less efficient than a simple array: it requires more memory than a simple array because of the additional pointers; it requires more time because it requires multiple memory accesses per element access. However, it can save significant space if the majority of the indexes have no value, as entire subtrees can be omitted.
By adding intermediate null checks, most of the tree can be omitted if only a few indices are needed:
*arr(struct height3 *root, unsigned short index) {
PAYLOAD if (!root) return NULL;
struct height2 *child1 = root [(index>>12)&(WIDTH-1)];
if (!child1) return NULL;
struct height1 *child2 = child1[(index>> 8)&(WIDTH-1)];
if (!child2) return NULL;
struct leaf *child3 = child2[(index>> 4)&(WIDTH-1)];
if (!child3) return NULL;
return &child3[index&(WIDTH-1)];
}
While this code is even less time-efficient than the previous code,
it is significantly more space-efficient if only a few values are in the
array.
For example, if only indices 0–10 (0x0000–0x000A) and
60,000–60,050 (0xEA60–0xEA92) are used, the only non-NULL pointers are
to:
- In the root, 0x0 and 0xE
- in child1 0x0: 0x0
- in child2 0x0: 0x0,
- in child3 0x0: 0x0 through 0xA
- in child2 0x0: 0x0,
- in child1 0xE: 0xA
- in child2 0xA: 0x6 through 0x9
- in child3 0x6: 0x0 through 0xF
- in child3 0x7: 0x0 through 0xF
- in child3 0x8: 0x0 through 0xF
- in child3 0x9: 0x0 through 0x2
- in child2 0xA: 0x6 through 0x9
- in child1 0x0: 0x0
Thus, only ten intermediate nodes and 5 × 16 = 80
PAYLOAD
values are allocated rather than the 65,536
PAYLOAD
values that would have been allocated for a simple
array.
3.3.2.2 Multi-level page table implementation
Most programs access only a few megabytes of memory, allocated in a few contiguous chunks, out of a 64-bit address space with 264 = 18,446,744,073,709,551,616 possible addresses. This is, in many ways, an ideal scenario for using fixed-depth high-arity trees to store a very sparse array of page table entries.
x86-64-compatible processors handle 64-bit addresses as follows:
They completely ignore the top 16 bits of each virtual address. 248 = 281,474,976,710,656 is seen as more than large enough an address space for the current generation of software applications.
They use 64-bit page table entries, with larger physical page numbers than virtual page numbers. This does not mean that all physical page numbers are used: it is the OS’s job, when creating new page table entries, to ensure that the physical page number picked corresponds to an actual address in available RAM.
They pick a page size such that every page table node exactly fits in one page.
The most common such page size is 4KB with 29 PTEs per node, using a four-level page table (9+9+9+9+12 = 48). Other page sizes also work: for example, 256KB pages with 215 PTEs per node, using a two-level page table (15+15+18 = 48), fits perfectly, and at a slight loss of space efficiency other sizes can be used as well.
Page size selection is a trade-off. The smaller the pages, the fewer bytes of memory are wasted because of partially-used pages, but the more time and space is devoted to the page table.
At one extreme, 16-byte pages (2 PTE per node) means almost no unused allocated memory but also means 44 intermediate page table nodes separate the PTBR and the PTE.
At the other extreme, 226-byte pages (a single-level page table filling half a page with 222 PTE) means that the smallest possible program needs at least two pages (128MB of memory) to run: one for the page table and one for the code, and a more realistic minimal program with at least 1 page of code, 1 page of read-only globals, 1 page of read/write globals, 1 page of heap, 1 page of stack, 1 page of shared library functions, and 1 page of kernel memory, plus the page table itself, requires 512MB.
4KB pages are generally seen as a nice intermediate point between tiny and huge pages, though how well they compare with other sizes in practice depends in large part on the application domain.
The trade-off between big and small pages becomes more significant as the address space increases. Trying to mitigate this trade-off is part of the reason why x86-64 only actually uses 48 bits of its 64-bit addresses.
x86-64 also allows mixed-size pages, where most pages are 4KB but a
few are 2MB. These big pages
are only used for data, not page
table nodes, meaning some paths down the page table tree pass through
one fewer node (9+9+9+21 = 48). This adds some complexity to the page
table traversal process, but allows the operating system more control
over how pages are distributed.
Each address translation that the TLB does not optimize runs a process similar to the following:
size_t vpn[4] = split_bits(va>>12, 9); size_t ppn = PTBR; for(int i=0; i<4; i+=1) { = ram[(ppn<<12) | (vpn[i]<<3)]; PTE pte if (pte.unmapped) page fault; if (pte.flags & current_usage) protection fault; = pte.ppn; ppn } = (ppn<<12) | (va & ((1<<12)-1)); pa
All of the above is performed by hardware, having been compiled down to physical transistors. It is not programmable.
The programmable parts are the fault handlers. A page fault handler (written in the OS software) might do something like
(size_t va, int access_type) {
handle_page_faultint flags = permitted_actions(va, segment_list);
if ((access_type & flags) != access_type)
(SIGSEGV);
send_signalssize_t ppn = get_unused_physical_page();
if (ppn < 0) ppn = swap_page();
(va, ppn, flags);
create_page_table_entry}
The above pseudocode uses several function stubs:
permitted_actions
- Check the OS list of segments to see what kinds of access (reading, writing, executing, etc) the segment containing this address allows. If this address is not in a segment, returns that no actions are allowed.
get_unused_physical_page
-
The OS maintains a list of all of the physical pages in RAM and which
process is using each. The idea here is to pull one page from the
unused
list onto theused by current process
list. swap_page
-
In the event that
get_unused_physical_page
fails to find an unused page, the OS picks a page currently being used by some process andswaps it out
: that is, it- Picks a physical page
- Writes its contents onto a region of disk known as the
swap space
- Changes the PTE for that page such that
- the unmapped bit is set
- the rest of the bits tell where the contents are located on disk
- Change that page to be owned by the current process
- Return the newly-freed physical page number
See Page Swapping for more on swapping.
create_page_table_entry
- Ensures that the page table entry is in the page table (possibly creating intermediate nodes in the tree) and sets the appropriate permissions and physical page number in the PTE. If the PTE already exists and notes where in the swap space the old contents of this page are located, loads the swapped-out contents back from disk into memory as well.
An important thing to note about the above: software (the operating system) writes the page table in a format that the hardware understands and can read.
The following depicts all of the lookups used by x86-64’s virtual memory system with 4K pages. Virtual addresses are in yellow, physical addresses in cyan, and page table entries in white.
The three zero bits at the end of the first four physical addresses
accomplish the index × 8
needed to compute the address of an
8-byte PTE by its index.
Of the above
- The user/compiler puts data in memory and picks the virtual address
- The OS puts the data into the PTBR and the PTEs into RAM
- The hardware does everything else depicted
3.3.3 Translation Lookaside Buffer
Address translation needs to happen for every instruction, at least once to fetch the instruction and a second time if the instruction contains a memory access, so making it fast is vital to keeping the computer efficient.
Modern processors use a special cache called the translation lookaside buffer (or TLB) to keep the most-used address translations in high-speed memory.
Conceptually, the TLB is used as follows:
size_t vpn_full = va>>12;
if (tlb_has(vpn_full)) {
= tlb_lookup(vpn_full);
pte if (pte.flags & current_usage) protection fault;
= pte.ppn;
ppn } else {
size_t vpn[4] = split_bits(vpn_full, 9);
size_t ppn = PTBR;
for(int i=0; i<4; i+=1) {
= ram[(ppn<<12) | (vpn[i]<<3)];
pte if (pte.unmapped) page fault;
if (pte.flags & current_usage) protection fault;
= pte.ppn;
ppn }
(vpn_full, pte);
tlb_store}
= (ppn<<12) | (va & ((1<<12)-1)); pa
For more details on how caches like the TLB work, see Caching.
3.4 Usage
Various uses of virtual memory are alluded to above to motivate the various features of virtual memory; this section focuses on those uses.
3.4.1 Lazy allocation
When a program requests memory or is initially loaded, rather than
having the program wait until all the requested space is actually setup,
the operating system can let the program proceed right away. Then, if
the program tries to access part of that newly allocated space too
early, the operating system can finish setting it up in the page table
handler. This lazy
allocation idea might, for example, help
programs start faster.
3.4.2 Page Swapping
Because addresses in code can be mapped to other addresses (with
operating system intervention if needed), virtual memory lets code act
beyond the constraints of available memory. This is traditionally
handled by the operating system storing some pages on the disk instead
of in memory. moving these pages between disk and memory is called
swapping.
(If the program tries to access pages that are not in
memory, the operating system can trigger swapping from the page fault
handler.)
Because disks are vastly slower than memory, there is a desire to make this happen rarely, meaning that operating systems sometimes devote significant effort to selecting pages to swap out of memory that it believes are likely not to be swapped back in again anytime soon. Predicting which pages will be needed next is undecidable4, so this is always at best a heuristic and sometimes incorrect. If it is consistently incorrect (or there is not enough physical memory available for accuracy to matter) and the process begins spending more time swapping pages in and out than it does doing actual work, it is said that the process is thrashing.
As memory has become less expensive and more plentiful, swap is arguably5 less important than it once was, but it is still a part of every major end-user OS.
3.4.3 Shared Memory
One of the goals of virtual memory is to allow each process to have its own address space, but because this is done by having a mapping between virtual and physical pages, it is quite possible to map the same physical page into multiple different processes’ virtual address spaces—potentially even to have the same physical page mapped to different virtual pages in each process.
Shared memory, as this is called, is used for several purposes, notably including the following:
- Shared libraries.
-
If several programs are going to load in the same shared library code, it can be more efficient to have its code in just one physical page shared in each process’s virtual address space.
Not all libraries are linked in as shared memory. Since shared memory must come in in full page chunks, shared memory library linkage can be inefficient if library code is small or not shared by several running processes. Additionally, sharing library code requires all processes that use it to use the exact same version of the code; as library code is frequently updated, this can itself be a challenge to engineer around.
Static library code is linked during compilation, while shared (also called
dynamic
) library code is linked by the loader or in some cases even part-way through execution. - Kernel memory.
-
Consider the page of memory that contains the instruction that changes the PTBR. For the next instruction to be fetched properly, that instruction’s virtual page needs to map to the same physical page in both the old and new page tables.
Different operating systems handle this differently; in Linux, for example, half of the entire address space (all with addresses 0x800000000000 or above) is reserved for the OS and is mapped to the same set of physical pages in every process’s page table.
- Shared-memory communication.
- Processes can communicate with one another by each reading what others have written into a shared region of memory. This can be a very efficient way of sharing large quantities of data but does not, by itself, provide mechanisms for efficiently awaiting a response.
- Memory-mapped files
-
As an alternative to reading and writing a file with system calls, many
operating systems allow a process to
map
a file into their memory. The bytes of the file then appear as bytes in that process’s memory. The operating system can ensure that all processes share a single copy of the file’s data, so the bytes always reflect the current version of the file.
3.4.4 Process Switching
Switching between processes happens in the operating system’s code. When the OS was invoked (in an Exception Handler), the state of the original process was stored somewhere; when changing processes,
- the saved user state is moved into an OS data structure to track the not-currently-running processes.
- the PTBR is changed to point to the page table of the new process, and the user page table entries in the translation lookaside buffer are marked as invalid.
- the saved state from the OS data structure is moved back into the saved user state location to be restored when the exception handler returns.
As this happens in operating system software, other computation is often attached to it. It is not uncommon for the operating system to perform many different bookkeeping and other operations each time a context switch is performed.
3.4.5 Direct Memory Access
Physical memory pages can be given to systems other than software processes. It is common for various input/output operations to be performed by allocating a page (or more) of memory to be accessed directly by the I/O systems. This direct memory access model allows I/O to happen at the slower pace of I/O systems concurrently with the processor doing other work as well as the processor to access it at the higher speed of memory access once the transfer into or out of memory is completed. This model is called direct memory access (DMA).
Technically, DMA does not require virtual memory—disks could be given access to physical memory even if code accesses memory with physical addresses—but it is generally implemented as an extension of the virtual memory system.
4 Further resources
4.1 General
- Arpaci-Dusseau and Arpaci-Dusseau, Operating Systems: Three Easy Pieces chapters 4, 6, 13, 18, 36. More detailed textbook on all these topics, intended in the context of a more dedicated operating systems course.
4.2 For signal handling
- The POSIX documentation for sigaction; the POSIX documentation for signal handling generally
4.3 Operating system organization examples
Sections 1.3–1.6 of Bovet and Cesati, Understanding the Linux Kernel, Third Edition, available through O’Reilly (click institution not listed and enter your UVa email)
Chapter 2 of McKusick and Neville-Neil, The Design and Implementation of the FreeBSD Operating System (also available through O’Reilly)
Chapter 2 of Yosifovich, Ionescu, Russinovich, and Solomon, Windows Internals Seventh Edition Part 1 (Also available through O’Reilly)
4.4 Virtual memory system examples
Levy and Lipman (1982),
Virtual Memory Management in the VAX/VMS Operating System
in IEEE Computer. An old virtual memory design that looks very similar to those of today.x86-64 paging hardware
OSdev wiki’s (hobbiest OS developer resource) description
The complete but difficult to read references: Volume 3, Chapters 3-4 of Intel’s x86 manuals or Volume 2, Chapters 4-5 of AMD’s x86 manuals. Describes the virtual memory system actually implemented by x86-64. This is made much more complicated because 32-bit x86 supports segmentation and paging simultaneously. In segmentation, rather than dividing up memory into many fixed-sized pieces, it is divided up into fewer, variable-sized pieces. Most operating systems on x86-64 mostly avoid using segmentation but use paging extensively.
Sometimes, you’ll see these modes be referred to by different names. For example:
- Exception Level 0 (user mode) and Exception Level 1 (kernel mode, roughly) in the ARM manual.
- Privilege level 0 or ring 0 (kernel mode) and privilege level 3 and ring 3 (user mode) in the Intel x86-64 manuals;
- User mode and supervisor mode in RISC V;
As some of these names suggest, some systems support additional modes of this type. These additional modes can allow for finer-grained restrictions on what operations are allowed.
Early proposals for these modes (e.g., Michael D. Schroeder and Jerome H. Saltzer,
A Hardware Architecture for Implementing Protection Rings
from 1972) often proposed supporting around eight privilege levels which were calledrings
(with the innermost rings being more privileged than the outer one). Despite some hardware implementing these designs, most operating systems only found uses for two modes, the lowest (kernel mode) and the highest (user mode).]↩︎or one per processor depending on the multiprocessor design↩︎
Note, however, that keeping with the tradition of using the same English word to mean multiple unrelated technical concepts, x86-64 uses the word
segment
in a distinct hardware sense to meanthe high-order bits of addresses we don’t want to express in full.
↩︎In case you have not yet had the pleasure of studying computational theory—or (even more sad to contemplate!) have forgotten important bits of it—
undecidable
basically meansonly solvable in special cases, not in general.
In the happier case where you do remember computational theory and were hoping to find a proof here, Rice’s Theorem is sufficient to demonstrate the undecidability of determining what pages will be needed in what order.↩︎
Arguably as in, I have heard friendly arguments between people who know more about this than I do on this topic.↩︎