(Last update 22 October: changelog)
Checkpoint due: Saturday October 15, 11:59PM
Full submission due: Tuesday October 25, 11:59PM
In this homework, you will use the gem5 simulator to explore how the performance and behavior of an out-of-order processor changes as various implementation factors are adjusted.
We recommend a 64-bit Linux machine or virtual machine to do this assignment. This is such a virtual machine image suitable for importing into VirtualBox. (It does not have any software installed on it beyond the Ubuntu defaults.) That VM image has a user created called vmuser
whose password is password
. If you use the supplied VM and have the RAM, I would suggest increasing the amount of memory allocated to it. Alternately, you might consider using vagrant and its ubuntu64/xenial `box’ (virtual machine template).
If you’d like to try to install this natively on a non-64-bit Linux virtual machine, we provide installation instructions below which should work on many Unix-like systems. If you’d like to run it without a virtual machine on Windows, this should also be possible. In the event of technical difficulties, however, our recommendation is going to be to use a Linux virtual machine.
After obtaining a sutiable environment, download a prebuilt copy of gem5 (which we have patched with this patch) and of our benchmarks. For this assignment, you should have some familiarity with the Linux command line. If you do not, see the brief guide below.
To run a 64-bit virtual machine in Virtualbox, you may need to enable VT-x, which is Intel’s name for extra hardware support for virtual machines. This is sometimes also called names like Intel Virtualization Technology
. For reasons I do not understand, often laptops or desktops ship with this feature disabled by default. If it is disabled, you can typically reenable it in BIOS or `the Setup Utility’. For example, on my laptop (a Thinkpad T460), I did this by pressing Enter while the machine was booting, selecting to go into the Setup tool
from the resulting prompt. Then, from the resulting menu, selecting Config
, then Security
, then Virtualization
. Then, I had an option to enable or disable Intel Virtualization Technology
. Instructions for how to do this will vary between machines, so I cannot give a universal guide here. But other common ways of entering Setup include pressing F12 or Del while booting, and you can probably find instructions online given the model of your laptop or desktop. If you have trouble figuring out how to do this on your system, please ask the course staff for assistance.
Fighting against Linux or virtual machine software is not an objective of this homework. If you have trouble with such issues, please don’t hesitate to ask us.
Submit a zip or tar archive containing:
For the checkpoint, you need only supply the tasks labelled (required for checkpoint)
below. For the final submission, you must do all tasks.
After installing gem5, download our benchmarks directory. This includes several benchmarks and their source code, as well as a script to run a gem5 simulation called `gem5script.py’. The gem5 executable expects to be passed a Python script like this, which is responsible for configuring the simulation environment. The script we have supplied is suitable for you to modify this assignment.
To start out, first build the benchmark programs using the supplied Makefile by running the make
command from the command-line. Try running the blocked-matmul prorgam out of the simulation using
./blocked-matmul
This program performs a 84x84 register-blocked matrix multiply and times it four times. You can read its source code in blocked-matmul.c. Now let us run this program under the processor simulator using the supplied script.
First, modify the gem5script.py
to point to where you downloaded GEM5, at the top of script is a line like
GEM_DIR = '../gem5'
Either place your downloaded copy of gem5 in the same directory that contains the benchmarks
directory or modify this to set GEM_DIR
to the location of your copy of gem5 (that you downloaded or built). Then find the location of the gem5.opt binary for your copy of gem5, which is in build/X86/gem5.opt. I suggest creating a symbolic link to this binary using a command like:
ln -s PATH/TO/GEM5_DIR/build/X86/gem5.opt ./
This command creates an for PATH/TO/GEM5_DIR/build/X86/gem5.opt
in the current directory. Further instructions in this homework assume this alias exists by running gem5 with ./gem5.opt
.
Note that you will need to refer to the gem5 source code repeatedly throughout this assignment to find out about what options are supported. The source code is included with our prebuilt version of gem5 for this reason.
Then run gem5script.py with gem5.opt, pointing it to the blocked-matmul program:
./gem5.opt gem5script.py \
--cmd=./blocked-matmul \
--directory=blocked-matmul-output
Some notes on this simulation:
It runs much slower than the original blocked-matmul program; around ten thousand times slower. This is the primary reason that our benchmark programs are `toy’ programs in this assignment. For real
architecture research, you would use larger benchmarks, but this would involve waiting much longer for simulation results.
The program run is an ordinary user-space program, but the simulated processor does not have an OS. This is gem5’s system call emulation
(SE) mode. This acts like a simulated processor except that there is no virtual memory, and the system call instructions magically do what the operating system would do rather than triggering an exception and calling the operating system. gem5 also supports a full system
(FS) mode, which can boot some real operating systems, which we will not be using in this assignment.
The simulator will output some spurious warning messages like:
info: Entering event queue @ 0. Starting simulation...
warn: ignoring syscall access(140737352001731, ...)
warn: ignoring syscall access(140737352012752, ...)
warn: ignoring syscall access(140737352001731, ...)
warn: ignoring syscall mprotect(140737349496832, ...)
warn: x86 cpuid family 0x0000: unimplemented function 7
warn: ignoring syscall mprotect(140737351593984, ...)
warn: ignoring syscall mprotect(6295552, ...)
warn: ignoring syscall mprotect(140737354121216, ...)
These are harmless.
The simulator outputs a message like:
Done simulation @ tick = 939334500: target called exit()
Which indicates indicated which simulation tick the program completed on. By default, each simulation tick represents 1 picosecond of simulation time, and the simulated CPU has a clock rate of 2 GHz, so this simulation represents around 93 billion ticks / (500 ticks / clock cycle) = 1.9M clock cycles of simulated time.
The more important outputs from the simulation are in the blocked-matmul-output directory specified via the –directory option. This will contain the following files:
config.ini
, config.json
: contain the full configuration of the components of the simulation.program.err
, program.out
: contain the output of the program. From program.out, you can read the amount of simulated time that blocked-matmul took.stats.txt
: contain numerous statistics from the simulation. Interesting statistics include:
sim_seconds
: simulation timesystem.cpu.ipc
: instructions per cycle achieved by the simulated CPUsystem.cpu.commit.loads
, system.cpu.commit.branches
, etc.: number of loads, branches, etc. finishedsystem.cpu.iew.exec_branches
: number of branches executed (regardless of whether taken), including ones only executed because of a misprediction of a prior branchsystem.cpu.iew.branchMispredicts
: number of branch mispredictionssystem.cpu.iq.fu_full::IntALU
: number of times a reservation buffer for integer ALU operations was not available but could have been usedsystem.cpu.dache.overall_miss_rate::cpu.data
: data cache miss rategem5script.py
takes some parameters that affect the simulation. For example, you can adjust the data cache size with the --l1d_size
option. Try running:
./gem5.opt gem5script.py \
--cmd=./blocked-matmul \
--directory=blocked-matmul-output-hugecache \
--l1d_size=2MB
[The represents a line continuation. This indicates that this is meant to all be one command, even though there are newlines in the middle of it.]
Notice that the value of system.cpu.dcache.overall_miss_rate::cpu.data in stats.txt from this simulation is much lower than the first simulation.
Note that the simulator takes on the order of a minute to simulate 5 milliseconds of simulated runtime, a slowdown of around 10000 times. For this reason, we will generally be only running very short benchmark programs in this assignment.
To supply command-line arguments to a program you run under gem5, you can pass the --options
option, for example:
./gem5.opt gem5script.py \
--cmd=./queens \
--directory=queens-default-output \
--options='-c 10'
runs the command queens -c 10
in the simulator.
This is handled by the code in the create_process()
function gem5script.py.
When supplying filenames, you should generally provide the full path to the program. The program will be run inside the output directory instead of in your current working directory. This is why we suggest using realpath
above.
Rather than vary cache parmaters, you will be primarily be varying parameters of the simulated out-of-order CPU. Most of these parameters don’t have convenient command-line options. Instead, you will need to modify the create_cpu() function in gemscript.py to change the parameters as you choose. A comment before this function shows some examples of how to modify parameters, including some which are more complicated to set. (In the original version of the benchmarks archive, it also has an error in the first paragraph, claiming the function varies the number of reorder buffers by default like is described below, but it does not as supplied.) If you aren’t familar with Python note that that text between triple quotes ("""
) in this file are effectively comments, including several paragraphs of text before def create_cpu
.
You should edit parameters by modifying gem5script.py
’s create_cpu
function, around where there is a comment reading YOUR CUSTOMIZATION CODE HERE
. The code you add can look like:
the_cpu.numROBEntries = 100
to run with 100 ROB (reorder buffer) entries instead of the default. For a full list of parameters in the out-of-order CPU model, look at the gem5 source file src/cpu/o3/O3CPU.py
.
You can also edit the get_options()
function to make the script support additional command-line options which you can access via the options
parameter to the create_cpu()
function. This can allow you to avoid changing the python file every time you want to run a different simulation.
For example, if you add code like:
parser.add_option('--vary', type=str, default='none')
you could do something like
if the_cpu.vary == 'numROB100':
the_cpu.numROBEntries = 100
elif the_cpu.vary == 'none':
pass # python for do nothing
else:
eprint("ERROR: unrecgonized --vary option")
sys.exit(1)
to let you run
./gem5.opt gem5script.py \
--cmd=./blocked-matmul \
--directory=blocked-matmul-output-rob100 \
--vary=rob100
to try 100 ROB entries outputting to blocked-matmul-output-rob100
and:
./gem5.opt gem5script.py \
--cmd=./blocked-matmul \
--directory=blocked-matmul-output-default
to try with the default configuration.
The simulated processor executes instructions in several stages, which are important to understand to reason about the statistics reported:
fetchWidth
parameter. This stage does branch prediction and branch target prediction to determine what to fetch.decodeWidth
parameter.renameWidth
parameter.dispatchWidth
parameter.wbWidth
parameter.commitWidth
parameter.In the event of branch misprediction, trap, or other speculative execution event, squashing
can occur at all stages of this pipeline. When a pending instruction is squashed, it is removed from the instruction queues, reorder buffers, requests to the instruction cache, etc.
The fetch, decode, rename and commit stages process instructions in program order. Other stages process instructions out-of-order based on availability of operands and results.
The simulated processor also lets one configure the latency between many of these stages — how many clock cycles it takes an instruction to pass from one phase to another in the best case.
To deal with the complicated instructions that are very common in X86, like a single instruction that performs a load and an add, the simulated processor splits many instructions into multiple microoperations
. Confusingly, it is often not clear whether statistics are referring to microops (and calling them instructoiuns) or real instructions. Generally, statistics about the issue, execute, and writeback phase will always concern microops (even if their descriptions use the word instructions
), and statistics about the commit phase will make it clear which are referred to.
I have selected several benchmark programs that should explore a range of demands on the simulated processor. For each program, I have a suggested way to run the program that should take not much more than a minute to simulate each time. If the program takes command-line arguments, you should pass the arguments when using them in simulations as described above.
blocked-matmul: a 2x2 register-blocked matrix multiplication of two 84x84 matrices. The matrices are pseudorandomly generated and all sizes are hard-coded. Source code for this benchmark is in blocked-matmul.c
.
Our suggsted command-line for this program is
./blocked-matmul
It takes no command-line arguments.
This program was selected because it should have a mix of cache accesses and floating point operations
BFS: computes a breadth-first search problem. This is taken from the Problem Based Benchmark Suite. Source code for this benchmark, along with utilities for generating graph data is in the breadthFirstSearch
directory.
We supply some example graphs in the inputs directory. Our suggested command-line for this program is
./BFS path/to/RL3k.graph
where path/to/rand-weighted-micro.graph is the full path to the rand-weighted-micro.graph supplied in the inputs directory of the benchmarks archive and via this link. You may get this with a command like:
realpath inputs/RL3k.graph
This program was selected because it should have poor data cache locality.
sha: computes the SHA-1 cryptographic hash of its input. Source code for this benchmark is in the sha-src
directory.
Our suggested command-line for this program is
./sha path/to/example-sha-input.txt
where path/to/example-sha-input.txt is the full path to the example-sha-input.txt supplied in the inputs directory. You may get this with a command like:
realpath inputs/example-sha-input.txt
This program was selected because it should be integer-operation intensive and very friendly for branch prediction and cache.
queens: solves the N queens problem for an N specified as an argument. This is taken from the LLVM (a compiler toolkit) test suite, but based on comments in the source file queens.c, it is much older. Source code is in the queens.c
file.
Our suggested command-line for this program is
./queens -c 10
The -c
option indicates to count solutions instead of printing out any solutions.
This program was selected because it should be very friendly to the cache, but very challenging for branch prediction.
Test each program with the suggested command-line above and create a table reporting the following with the default settings for the simulated processor:
gem5 includes a pipeline viewer, briefly described here. Try viewing the pipeline during the execution of the blocked-matmul program. First run blocked-matmul with debugging options to record pipelines from simulation ticks 500000000 to 501000000:
./gem5.opt --debug-flags=O3PipeView --debug-start=500000000 --debug-file=.`/trace.out \
gemscript.py --directory=matmul-trace -c ./blocked-matmul -m 501000000
The -m option specifies to terminate the simulation after running 501000000 ticks. After running this command, you will have the usual output the matmul-trace directory, and file called trace.out in that directory. If you look at this file, you will see that it cotnains a list of instructions, along with the clock tick in which they completed each phase of their execution. If an instruction was speculatively executed and did not complete some phase of its execution, 0
is listed instead.
You can pass this trace.out to gem5’s pipeline viewer to get an easier to read output:
PATH/TO/GEM5/util/o3-pipeview.py --cycle-time=500 --color ./trace.out -o pipeview.out
The –cycle-time option specifies the clock rate of the simulated CPU in simulation ticks. This command produces an output file call pipeview.out, which you can view with a command like:
less -r pipeview.out
Note that the output is very wide by default. You can adjust the width of the output by supplying a value for the -w option:
PATH/TO/GEM5/util/o3-pipeview.py -w 40 --cycle-time=500 --color ./trace.out -o pipeview.out
For each instruction, this indicates what the address of the instruction was, the kind of instruction it was (using gem5’s internal name) and how long each phase of its completion took.
Some lines have instruction names prefixed with
. These represent instructions which were executed speculatively and `squashed’ (its effects cancelled), for example because the branch was mispredicted.-----
For this section, use only the BFS program. Notice that it has a relatively high data cache miss rate. The default configuration of the processor not only overlaps cache misses with other comptuation but can have many outstanding cache misses.
The simulated cache has a number of MSHR (Miss Status Handling Registers) that each can make up to one request at a time to the memory system. The number of MSHRs is controlled by the mshrs
parameter of each cache object (like the dcache
variable produced by the supplied code).
Adjusting the number of MSHRs in the data cache to 1 (from the default of 8 4) for the BFS program. Compare the performance to a higher number of MSHRs and the system.cpu.dcache.blocked_cycles::no_mshrs
statistics from each run.
What is the approximate benefit (for simulated runtime) of overlapping cache misses for this program?
Try adjusting the data cache size to decrease the number of cache misses to a neglible number. Based on this, what portion of performance was lost from running the program with the original number of MSHRs versus the data cache not being factor?
Examine the system.cpu.dcache.overall_miss_latency
statistic, which shows the total number of simulation ticks memory accesses triggered by a cache miss took. Based on this (and any other statistics), estimate the performance gained by the ability to overlap cache misses with other instructions (both other cache acceses and non-cache instructions like integer arithmetic). Identify any major limitations of your estimate.
The overall_miss_latency
statistic counts time when there are two active cache misses twice. For example, if there is a miss at time 1 that takes until time 11 to resolve and a miss at time 5 that takes until time 14, the overall_miss_latency
will be (11-1) + (14-5) = 19.
When making your estimate, note that the simulated processor always overlaps cache misses with other operations, even when configured with one MSHR, so it can only handle one cache miss at a time.
Examine the counts of `squashed’ instructions and memory requests ignored due to squashing.
Based on the counters, estimate what portion of throughput is lost due to branch misprediction for each benchmark program. Identify any major limitations of your estimate.
Based on the rate of branches, estimate the total benefit of the correct branch prediction for each benchmark. Identify any major limitations of your estimate.
Change the branch predictor to the NeverTakenBP we have provided (by setting the_cpu.branchPred = NeverTakenBP()
) and run each benchmark again. This branch predictor predicts every conditional branch as not taken resulting in a very low rate of correct branch prediction. How does the performance difference compare to your estimates?
Examine the instruction mixes for each benchmark program based on committed microoperations (the system.cpu.commit.op_class_0::
counters). Note that program instructions may be split into multiple microoperations each of which is executed indiivdually on functional units.
Examine the available functional units in the gem5 source code in src/o3/FuncUnitConfig.py
. In FuncUnitConfig.py
, the counts indicate the width
of each functional unit available. Unless otherwise specified with the pipelined=False, each functional unit can dispatch count instructions per cycle. Unless specified with the opLat option, each functional unit takes 1 cycle to produce a result after a value is dispatched to it.
Based on the instruction mix and the available functional units, what is the maximum number of microoperations which could be dispatched per cycle? Assume that the mix of instructions is approximately constant throughout each benchmark’s execution.
Based on this, about how close is each program to the maximum possible computation rate for the execute phase?
Note that, by default, each of the maximum number of instructions (usually actually microops) per cycle for each of these phases is 8.
Consider changing all these widths to 4. Which benchmark seems like it should be most affected and why? Which benchmark seems like it should be least affected and why?
Now try running each benchmark with all these widths changed to 4. How does performance actually change?
A fundamental limit on out-of-order processors is the dependence limit
— the inability to issue instructions faster than their dependencies can be computed. The processor in question tries to evade the dependence limit somewhat by speculatively executing past unexecuted branches, and speculatively executing memory accesses before it knows if their addresses will conflict, but there is a limit on how far it can go.
Try to run the simulated processor much closer to this dependence limit by increasing all the widths from Part E.3 along with dramatically increasing the number of functional units of each type, the number of reorder buffer entries, the number of physical registers of every type, and the size of the load/store queues. How much does the instructions per cycle increase by?
Choose one of following parameters:
For the parameter you choose, look for evidence in benchmark statistics from benchmarks you already ran that changing it is likely to have a small or large effect on performance. What do you find? Do you think it is possible to infer the effects of the parameter from the statistics? Explain why or why not.
Try varying the parameter and running the benchmark programs. How much does performance change? Does this match your expectations?
If you get an error about gettid being unimplemented, check the program.err file. Often this means the program failed and tried to execute an error handling routine that needed to know some process information that gem5 does not simulate properly.
You should not need the instructions in the section if you are using our prebuilt archives on a 64-bit Linux system.
The benchmarks includes a Makefile. If you have GNU make and gcc and g++ installed, running make clean
, then make
should rebuild all the benchmark programs for your system.
If you are using one of our supplied VMs, you may need to install these packages. One way to do this is sudo apt-get install build-essential
.
If you’d like to build the benchmarks with a different compiler and/or differnt compiler options, you will need to edit the Makefile
and the Makefile in breadthFirstSearch/deterministicBFS
and the Makefile in sha
.
These instructions are for building gem5 on a Unix-like system.
To build gem5 from source, first make sure the following dependencies (listed with Ubuntu/Debian package names) are installed (from the video on this page):
Then acquire the gem5 source. You can take our prebuilt package and remove the build
directory. Alternatively, you can checkout gem5 from mercurial yourself using
hg clone http://repo.gem5.org/gem5
and then apply our patch using
patch -p 1 < ./gem5.patch
Then use scons build/X86/gem5.opt
to actually build the gem5 directory.
From a default graphical install of Ubuntu, log in and use alt-F2 and type gnome-terminal
to get a command line window. (Alternately, if using vagrant, vagrant ssh
will give you a command-line window in the virtual machine.)
Then, here are some useful commands (adapted from our advice in CS 3330):
pwd
: identify the current directory you are inls
: list files in the current directorymkdir
: create a new directoryman ls
: display a manual page for ls. This also works for essentially every other command.cd dirname
: enter the dirname
directorycd ..
: enter the parent directory of the current directory; for example, if you are in /home/cr4bd/benchmarks
, the parent directory is /home/cr4bd
. You can also generally use ..
to refer to files in the parent directory, for example ../foo.txt
names a file called foo.txt in the parent directory../foo
runs an executable called foo
in the current directory.history
: show commands you ran recently.wget URL
to download the file from URL to the current directory.tar -zxvf FILE.tar.gz
to extract the tar archives we provide.gedit &
: open a graphical text editor. Other possible graphical editors include geany
and kate
. The &
indicates to run the command in the background.nano
or pico
: open an editor that operates in the terminal (likely necessary if you are SSHing into your VM.)sudo apt install PACKAGE
. Install a software package. If you’re using Ubuntu 16.04, you can search possible packages on this page, selecting the xenial
distribution. The sudo
indicates to run the specified command as root. It will require your password. In the VM image we supply, the password is password
.In the terminal, you can use up-arrow and down-arrow to repeat recent commands. If you press TAB, this will try to complete your current command, for example cd be
TAB will probably result in cd benchmarks
if a directory named benchmarks
exists in the current directory. If there were multiple directories starting with be
, then it would not complete to any name, but pressing TAB twice would give you a list of the possible completions.
sudo apt install virtualbox-guest-dkms
)MSTto
BFSin Task C. (A draft of this assignment had a different graph-based benchmark program in place of BFS.)
Varying Simulation Parameterssection; don’t mention things that aren’t in the current example comments.
create_cpu
.Varying Simulation Parameterssection;