Assignment: OOO

In this homework, you will use the gem5 simulator to explore how the performance and behavior of an out-of-order processor changes as various implementation factors are adjusted.

Infrastructure

We recommend a 64-bit Linux machine or virtual machine to do this assignment. This is such a virtual machine image suitable for importing into VirtualBox. (It does not have any software installed on it beyond the Ubuntu defaults.) That VM image has a user created called vmuser whose password is password. If you use the supplied VM and have the RAM, I would suggest increasing the amount of memory allocated to it. Alternately, you might consider using vagrant and its ubuntu64/xenial `box’ (virtual machine template).

If you’d like to try to install this natively on a non-64-bit Linux virtual machine, we provide installation instructions below which should work on many Unix-like systems. If you’d like to run it without a virtual machine on Windows, this should also be possible. In the event of technical difficulties, however, our recommendation is going to be to use a Linux virtual machine.

After obtaining a sutiable environment, download a prebuilt copy of gem5 (which we have patched with this patch) and of our benchmarks. For this assignment, you should have some familiarity with the Linux command line. If you do not, see the brief guide below.

VM issues

To run a 64-bit virtual machine in Virtualbox, you may need to enable VT-x, which is Intel’s name for extra hardware support for virtual machines. This is sometimes also called names like “Intel Virtualization Technology”. For reasons I do not understand, often laptops or desktops ship with this feature disabled by default. If it is disabled, you can typically reenable it in BIOS or `the Setup Utility’. For example, on my laptop (a Thinkpad T460), I did this by pressing Enter while the machine was booting, selecting to go into the “Setup tool” from the resulting prompt. Then, from the resulting menu, selecting “Config”, then “Security”, then “Virtualization”. Then, I had an option to enable or disable “Intel Virtualization Technology”. Instructions for how to do this will vary between machines, so I cannot give a universal guide here. But other common ways of entering Setup include pressing F12 or Del while booting, and you can probably find instructions online given the model of your laptop or desktop. If you have trouble figuring out how to do this on your system, please ask the course staff for assistance.

Fighting against Linux or virtual machine software is not an objective of this homework. If you have trouble with such issues, please don’t hesitate to ask us.

Deliverables

Submit a zip or tar archive containing:

For the checkpoint, you need only supply the tasks labelled “(required for checkpoint)” below. For the final submission, you must do all tasks.

gem5 Explanation/Tutorial

An Example Simulation

After installing gem5, download our benchmarks directory. This includes several benchmarks and their source code, as well as a script to run a gem5 simulation called `gem5script.py’. The gem5 executable expects to be passed a Python script like this, which is responsible for configuring the simulation environment. The script we have supplied is suitable for you to modify this assignment.

To start out, first build the benchmark programs using the supplied Makefile by running the make command from the command-line. Try running the blocked-matmul prorgam out of the simulation using

./blocked-matmul

This program performs a 84x84 register-blocked matrix multiply and times it four times. You can read its source code in blocked-matmul.c. Now let us run this program under the processor simulator using the supplied script.

First, modify the gem5script.py to point to where you downloaded GEM5, at the top of script is a line like

GEM_DIR = '../gem5'

Either place your downloaded copy of gem5 in the same directory that contains the benchmarks directory or modify this to set GEM_DIR to the location of your copy of gem5 (that you downloaded or built). Then find the location of the gem5.opt binary for your copy of gem5, which is in build/X86/gem5.opt. I suggest creating a symbolic link to this binary using a command like:

ln -s PATH/TO/GEM5_DIR/build/X86/gem5.opt ./

This command creates an for PATH/TO/GEM5_DIR/build/X86/gem5.opt in the current directory. Further instructions in this homework assume this alias exists by running gem5 with ./gem5.opt.

Note that you will need to refer to the gem5 source code repeatedly throughout this assignment to find out about what options are supported. The source code is included with our prebuilt version of gem5 for this reason.

Then run gem5script.py with gem5.opt, pointing it to the blocked-matmul program:

./gem5.opt gem5script.py \
    --cmd=./blocked-matmul \
    --directory=blocked-matmul-output

Some notes on this simulation:

These are harmless.

The more important outputs from the simulation are in the blocked-matmul-output directory specified via the –directory option. This will contain the following files:

gem5script.py takes some parameters that affect the simulation. For example, you can adjust the data cache size with the --l1d_size option. Try running:

./gem5.opt gem5script.py \
    --cmd=./blocked-matmul \
    --directory=blocked-matmul-output-hugecache \
    --l1d_size=2MB

[The \ represents a line continuation. This indicates that this is meant to all be one command, even though there are newlines in the middle of it.]

Notice that the value of system.cpu.dcache.overall_miss_rate::cpu.data in stats.txt from this simulation is much lower than the first simulation.

Note that the simulator takes on the order of a minute to simulate 5 milliseconds of simulated runtime, a slowdown of around 10000 times. For this reason, we will generally be only running very short benchmark programs in this assignment.

Supplying Command-Line Arguments

To supply command-line arguments to a program you run under gem5, you can pass the --options option, for example:

./gem5.opt gem5script.py \
    --cmd=./queens \
    --directory=queens-default-output \
    --options='-c 10'

runs the command queens -c 10 in the simulator.

This is handled by the code in the create_process() function gem5script.py.

When supplying filenames, you should generally provide the full path to the program. The program will be run inside the output directory instead of in your current working directory. This is why we suggest using realpath above.

Varying Simulation Parameters

Rather than vary cache parmaters, you will be primarily be varying parameters of the simulated out-of-order CPU. Most of these parameters don’t have convenient command-line options. Instead, you will need to modify the create_cpu() function in gemscript.py to change the parameters as you choose. A comment before this function shows some examples of how to modify parameters, including some which are more complicated to set. (In the original version of the benchmarks archive, it also has an error in the first paragraph, claiming the function varies the number of reorder buffers by default like is described below, but it does not as supplied.) If you aren’t familar with Python note that that text between triple quotes (""") in this file are effectively comments, including several paragraphs of text before def create_cpu.

You should edit parameters by modifying gem5script.py’s create_cpu function, around where there is a comment reading YOUR CUSTOMIZATION CODE HERE. The code you add can look like:

the_cpu.numROBEntries = 100

to run with 100 ROB (reorder buffer) entries instead of the default. For a full list of parameters in the out-of-order CPU model, look at the gem5 source file src/cpu/o3/O3CPU.py.

You can also edit the get_options() function to make the script support additional command-line options which you can access via the ‘options’ parameter to the create_cpu() function. This can allow you to avoid changing the python file every time you want to run a different simulation.

For example, if you add code like:

parser.add_option('--vary', type=str, default='none')

you could do something like

if the_cpu.vary == 'numROB100':
    the_cpu.numROBEntries = 100
elif the_cpu.vary == 'none':
    pass # python for do nothing
else:
    eprint("ERROR: unrecgonized --vary option")
    sys.exit(1)

to let you run

./gem5.opt gem5script.py \
    --cmd=./blocked-matmul \
    --directory=blocked-matmul-output-rob100 \
    --vary=rob100

to try 100 ROB entries outputting to blocked-matmul-output-rob100 and:

./gem5.opt gem5script.py \
    --cmd=./blocked-matmul \
    --directory=blocked-matmul-output-default

to try with the default configuration.

Stages of the Simulated Processor

The simulated processor executes instructions in several stages, which are important to understand to reason about the statistics reported:

In the event of branch misprediction, trap, or other speculative execution event, “squashing” can occur at all stages of this pipeline. When a pending instruction is squashed, it is removed from the instruction queues, reorder buffers, requests to the instruction cache, etc.

The fetch, decode, rename and commit stages process instructions in program order. Other stages process instructions out-of-order based on availability of operands and results.

The simulated processor also lets one configure the latency between many of these stages — how many clock cycles it takes an instruction to pass from one phase to another in the best case.

To deal with the complicated instructions that are very common in X86, like a single instruction that performs a load and an add, the simulated processor splits many instructions into multiple ‘microoperations’. Confusingly, it is often not clear whether statistics are referring to microops (and calling them instructoiuns) or real instructions. Generally, statistics about the issue, execute, and writeback phase will always concern microops (even if their descriptions use the word “instructions”), and statistics about the commit phase will make it clear which are referred to.

Some Terminology in the Program Statistics

Supplied Benchmark Programs

I have selected several benchmark programs that should explore a range of demands on the simulated processor. For each program, I have a suggested way to run the program that should take not much more than a minute to simulate each time. If the program takes command-line arguments, you should pass the arguments when using them in simulations as described above.

Tasks

Part A: General attributes of benchmark programs (required for checkpoint)

Test each program with the suggested command-line above and create a table reporting the following with the default settings for the simulated processor:

Part B: Use the pipeline viewer (required for checkpoint)

gem5 includes a pipeline viewer, briefly described here. Try viewing the pipeline during the execution of the blocked-matmul program. First run blocked-matmul with debugging options to record pipelines from simulation ticks 500000000 to 501000000:

./gem5.opt --debug-flags=O3PipeView --debug-start=500000000 --debug-file=.`/trace.out \
    gemscript.py --directory=matmul-trace -c ./blocked-matmul -m 501000000

The -m option specifies to terminate the simulation after running 501000000 ticks. After running this command, you will have the usual output the matmul-trace directory, and file called trace.out in that directory. If you look at this file, you will see that it cotnains a list of instructions, along with the clock tick in which they completed each phase of their execution. If an instruction was speculatively executed and did not complete some phase of its execution, ‘0’ is listed instead.

You can pass this trace.out to gem5’s pipeline viewer to get an easier to read output:

PATH/TO/GEM5/util/o3-pipeview.py --cycle-time=500 --color ./trace.out -o pipeview.out

The –cycle-time option specifies the clock rate of the simulated CPU in simulation ticks. This command produces an output file call pipeview.out, which you can view with a command like:

less -r pipeview.out

Note that the output is very wide by default. You can adjust the width of the output by supplying a value for the -w option:

PATH/TO/GEM5/util/o3-pipeview.py -w 40 --cycle-time=500 --color ./trace.out -o pipeview.out

For each instruction, this indicates what the address of the instruction was, the kind of instruction it was (using gem5’s internal name) and how long each phase of its completion took.

Some lines have instruction names prefixed with ‘-----’. These represent instructions which were executed speculatively and `squashed’ (its effects cancelled), for example because the branch was mispredicted.

Part C: Effective cache miss penalty versus miss latency (required for checkpoint)

For this section, use only the BFS program. Notice that it has a relatively high data cache miss rate. The default configuration of the processor not only overlaps cache misses with other comptuation but can have many outstanding cache misses.

The simulated cache has a number of MSHR (Miss Status Handling Registers) that each can make up to one request at a time to the memory system. The number of MSHRs is controlled by the mshrs parameter of each cache object (like the dcache variable produced by the supplied code).

  1. Adjusting the number of MSHRs in the data cache to 1 (from the default of 8 4) for the BFS program. Compare the performance to a higher number of MSHRs and the system.cpu.dcache.blocked_cycles::no_mshrs statistics from each run.

    What was the difference in performance?

  2. Try adjusting the data cache size to decrease the number of cache misses to a neglible number. What was the difference in performance?

  3. Examine the system.cpu.dcache.overall_miss_latency statistic, which shows the total number of simulation ticks memory accesses triggered by a cache miss took. Based on this (and any other statistics), estimate the performance gained by the ability to overlap cache misses with other instructions (both other cache acceses and non-cache instructions like integer arithmetic). Identify any major limitations of your estimate.

    The overall_miss_latency statistic counts time when there are two active cache misses twice. For example, if there is a miss at time 1 that takes until time 11 to resolve and a miss at time 5 that takes until time 14, the overall_miss_latency will be (11-1) + (14-5) = 19.

    When making your estimate, note that the simulated processor always overlaps cache misses with other operations, even when configured with one MSHR, so it can only handle one cache miss at a time.

Part D: Branch Prediction Benefits

Examine the counts of `squashed’ instructions and memory requests ignored due to squashing.

  1. Based on the counters, estimate what portion of throughput is lost due to branch misprediction for each benchmark program. Identify any major limitations of your estimate.

  2. Based on the rate of branches, estimate the total benefit of the correct branch prediction for each benchmark. Identify any major limitations of your estimate.

  3. Change the branch predictor to the NeverTakenBP we have provided (by setting the_cpu.branchPred = NeverTakenBP()) and run each benchmark again. This branch predictor predicts every conditional branch as not taken resulting in a very low rate of correct branch prediction. How does the performance difference compare to your estimates?

Part E: Achievable bandwidth

Examine the instruction mixes for each benchmark program based on committed microoperations (the system.cpu.commit.op_class_0:: counters). Note that program instructions may be split into multiple microoperations each of which is executed indiivdually on functional units.

Examine the available functional units in the gem5 source code in src/o3/FuncUnitConfig.py. In FuncUnitConfig.py, the counts indicate the “width” of each functional unit available. Unless otherwise specified with the pipelined=False, each functional unit can dispatch count instructions per cycle. Unless specified with the opLat option, each functional unit takes 1 cycle to produce a result after a value is dispatched to it.

  1. Based on the instruction mix and the available functional units, if the other stages were not bottlenecks, what is the maximum average number of microoperations which could be dispatched per cycle? Assume that the mix of instructions is approximately constant throughout each benchmark’s execution.

  2. How close is each program to the maximum possible average computation rate it could achieve given its instructions and the available functional units?

  3. Computation speed is also limited by the widths of the non-execute phases. Examine the counters and histograms in the stats.txt files for:
    • instructions fetched per cycle
    • instructions decoded per cycle
    • instructions renamed per cycle
    • instructions dispatched per cycle
    • instructions issued per cycle
    • instructions committed per cycle
    • instructions squashed per cycle (when undoing a mispredicted branch in the commit phase)

    Note that, by default, each of the maximum number of instructions (usually actually microops) per cycle for each of these phases is 8.

    Consider changing all these widths to 4. Which benchmark seems like it should be most affected and why? Which benchmark seems like it should be least affected and why?

  4. Now try running each benchmark with all these widths changed to 4. How does performance actually change?

  5. A fundamental limit on out-of-order processors is the “dependence limit” — the inability to issue instructions faster than their dependencies can be computed. The processor in question tries to evade the dependence limit somewhat by speculatively executing past unexecuted branches, and speculatively executing memory accesses before it knows if their addresses will conflict, but there is a limit on how far it can go.

    Try to run the simulated processor much closer to this dependence limit by increasing all of

    • all the widths from Part E.3
    • the number of functional units of each type
    • the number of reorder buffer entries
    • the number of physical registers of every type, and
    • the size of the load/store queues. How much does the instructions per cycle increase by?

Part F: Miscellaneous Resource

Choose two of following parameters:

a. the latency of the available functional units; b. the size of the reorder buffer; c. the size of the instruction queue; d. the sizes of the load and store queues; e. the number of physical registers (internal register names); f. the size of the branch predictors;

  1. For each parameter you choose, look for evidence in benchmark statistics from benchmarks you already ran that changing it is likely to have a small or large effect on performance. What do you find? Do you think it is possible to infer the effects of the parameter from the statistics? Explain why or why not.

  2. Try varying the parameter and running the benchmark programs. How much does performance change? Does this match your expectations?

Note on errors

If you get an error about gettid being unimplemented, check the program.err file. Often this means the program failed and tried to execute an error handling routine that needed to know some process information that gem5 does not simulate properly.

Building benchmarks and gem5

You should not need the instructions in the section if you are using our prebuilt archives on a 64-bit Linux system.

Rebuilding the benchmarks

The benchmarks includes a Makefile. If you have GNU make and gcc and g++ installed, running make clean, then make should rebuild all the benchmark programs for your system.

If you are using one of our supplied VMs, you may need to install these packages. One way to do this is sudo apt-get install build-essential.

If you’d like to build the benchmarks with a different compiler and/or differnt compiler options, you will need to edit the Makefile and the Makefile in breadthFirstSearch/deterministicBFS and the Makefile in sha.

Rebuilding gem5

These instructions are for building gem5 on a Unix-like system.

To build gem5 from source, first make sure the following dependencies (listed with Ubuntu/Debian package names) are installed (from the video on this page):

Then acquire the gem5 source. You can take our prebuilt package and remove the build directory. Alternatively, you can checkout gem5 from mercurial yourself using

hg clone http://repo.gem5.org/gem5

and then apply our patch using

patch -p 1 < ./gem5.patch

Then use scons build/X86/gem5.opt to actually build the gem5 directory.

Linux quick start

From a default graphical install of Ubuntu, log in and use alt-F2 and type gnome-terminal to get a command line window. (Alternately, if using vagrant, vagrant ssh will give you a command-line window in the virtual machine.)

Then, here are some useful commands (adapted from our advice in CS 3330):

In the terminal, you can use up-arrow and down-arrow to repeat recent commands. If you press TAB, this will try to complete your current command, for example cd beTAB will probably result in cd benchmarks if a directory named benchmarks exists in the current directory. If there were multiple directories starting with be, then it would not complete to any name, but pressing TAB twice would give you a list of the possible completions.