Assignment: OOO

Infrastructure
VM issues
Deliverables
gem5 Explanation/Tutorial
Supplied Benchmark Programs
Tasks
Note on errors
Building benchmarks and gem5
- Rebuilding the benchmarks
- Rebuilding gem5
Linux quick start

In this homework, you will use the gem5 simulator to explore how the performance and behavior of an out-of-order processor changes as various implementation factors are adjusted.

Infrastructure

prebuilt gem5
- patch to original gem5 source (if you’re building it yourself)
prebuilt benchmarks
- sample inputs to the benchmarks are included in the archive (as of 6 October; apologies for the error) and available seperately
VirtualBox compatible 64-bit Ubuntu 16.04 image with no software preinstalled. You should be able to use almost any 64-bit x86 Linux installation; this is just provided as an example. (If you want a more minimal, faster GUI-less VM, a vagrant box (mentioned below) will do it.)

We recommend a 64-bit Linux machine or virtual machine to do this assignment. This is such a virtual machine image suitable for importing into VirtualBox. (It does not have any software installed on it beyond the Ubuntu defaults.) That VM image has a user created called vmuser whose password is password. If you use the supplied VM and have the RAM, I would suggest increasing the amount of memory allocated to it. Alternately, you might consider using vagrant and its ubuntu64/xenial `box’ (virtual machine template).

If you’d like to try to install this natively on a non-64-bit Linux virtual machine, we provide installation instructions below which should work on many Unix-like systems. If you’d like to run it without a virtual machine on Windows, this should also be possible. In the event of technical difficulties, however, our recommendation is going to be to use a Linux virtual machine.

After obtaining a sutiable environment, download a prebuilt copy of gem5 (which we have patched with this patch) and of our benchmarks. For this assignment, you should have some familiarity with the Linux command line. If you do not, see the brief guide below.

VM issues

To run a 64-bit virtual machine in Virtualbox, you may need to enable VT-x, which is Intel’s name for extra hardware support for virtual machines. This is sometimes also called names like “Intel Virtualization Technology”. For reasons I do not understand, often laptops or desktops ship with this feature disabled by default. If it is disabled, you can typically reenable it in BIOS or `the Setup Utility’. For example, on my laptop (a Thinkpad T460), I did this by pressing Enter while the machine was booting, selecting to go into the “Setup tool” from the resulting prompt. Then, from the resulting menu, selecting “Config”, then “Security”, then “Virtualization”. Then, I had an option to enable or disable “Intel Virtualization Technology”. Instructions for how to do this will vary between machines, so I cannot give a universal guide here. But other common ways of entering Setup include pressing F12 or Del while booting, and you can probably find instructions online given the model of your laptop or desktop. If you have trouble figuring out how to do this on your system, please ask the course staff for assistance.

Fighting against Linux or virtual machine software is not an objective of this homework. If you have trouble with such issues, please don’t hesitate to ask us.

Deliverables

Submit a zip or tar archive containing:

an HTML, PDF, or text document containing answers to questions in and explanations required by the Tasks listed below;
supporting simulation results (raw data files)

For the checkpoint, you need only supply the tasks labelled “(required for checkpoint)” below. For the final submission, you must do all tasks.

gem5 Explanation/Tutorial

An Example Simulation

After installing gem5, download our benchmarks directory. This includes several benchmarks and their source code, as well as a script to run a gem5 simulation called `gem5script.py’. The gem5 executable expects to be passed a Python script like this, which is responsible for configuring the simulation environment. The script we have supplied is suitable for you to modify this assignment.

To start out, first build the benchmark programs using the supplied Makefile by running the make command from the command-line. Try running the blocked-matmul prorgam out of the simulation using

./blocked-matmul

This program performs a 84x84 register-blocked matrix multiply and times it four times. You can read its source code in blocked-matmul.c. Now let us run this program under the processor simulator using the supplied script.

First, modify the gem5script.py to point to where you downloaded GEM5, at the top of script is a line like

GEM_DIR = '../gem5'

Either place your downloaded copy of gem5 in the same directory that contains the benchmarks directory or modify this to set GEM_DIR to the location of your copy of gem5 (that you downloaded or built). Then find the location of the gem5.opt binary for your copy of gem5, which is in build/X86/gem5.opt. I suggest creating a symbolic link to this binary using a command like:

ln -s PATH/TO/GEM5_DIR/build/X86/gem5.opt ./

This command creates an for PATH/TO/GEM5_DIR/build/X86/gem5.opt in the current directory. Further instructions in this homework assume this alias exists by running gem5 with ./gem5.opt.

Note that you will need to refer to the gem5 source code repeatedly throughout this assignment to find out about what options are supported. The source code is included with our prebuilt version of gem5 for this reason.

Then run gem5script.py with gem5.opt, pointing it to the blocked-matmul program:

./gem5.opt gem5script.py \
    --cmd=./blocked-matmul \
    --directory=blocked-matmul-output

Some notes on this simulation:

It runs much slower than the original blocked-matmul program; around ten thousand times slower. This is the primary reason that our benchmark programs are `toy’ programs in this assignment. For “real” architecture research, you would use larger benchmarks, but this would involve waiting much longer for simulation results.
The program run is an ordinary user-space program, but the simulated processor does not have an OS. This is gem5’s ‘system call emulation’ (SE) mode. This acts like a simulated processor except that there is no virtual memory, and the system call instructions magically do what the operating system would do rather than triggering an exception and calling the operating system. gem5 also supports a ‘full system’ (FS) mode, which can boot some real operating systems, which we will not be using in this assignment.

The simulator will output some spurious warning messages like:

 info: Entering event queue @ 0.  Starting simulation...
 warn: ignoring syscall access(140737352001731, ...)
 warn: ignoring syscall access(140737352012752, ...)
 warn: ignoring syscall access(140737352001731, ...)
 warn: ignoring syscall mprotect(140737349496832, ...)
 warn: x86 cpuid family 0x0000: unimplemented function 7
 warn: ignoring syscall mprotect(140737351593984, ...)
 warn: ignoring syscall mprotect(6295552, ...)
 warn: ignoring syscall mprotect(140737354121216, ...)

These are harmless.

The simulator outputs a message like:

Done simulation @ tick = 939334500: target called exit()

Which indicates indicated which simulation tick the program completed on. By default, each simulation tick represents 1 picosecond of simulation time, and the simulated CPU has a clock rate of 2 GHz, so this simulation represents around 93 billion ticks / (500 ticks / clock cycle) = 1.9M clock cycles of simulated time.

The more important outputs from the simulation are in the blocked-matmul-output directory specified via the –directory option. This will contain the following files:

config.ini, config.json: contain the full configuration of the components of the simulation.
program.err, program.out: contain the output of the program. From program.out, you can read the amount of simulated time that blocked-matmul took.
stats.txt: contain numerous statistics from the simulation. Interesting statistics include:
- sim_seconds: simulation time
- system.cpu.ipc: instructions per cycle achieved by the simulated CPU
- system.cpu.commit.loads, system.cpu.commit.branches, etc.: number of loads, branches, etc. finished
- system.cpu.iew.exec_branches: number of branches executed (regardless of whether taken), including ones only executed because of a misprediction of a prior branch
- system.cpu.iew.branchMispredicts: number of branch mispredictions
- system.cpu.iq.fu_full::IntALU: number of times a reservation buffer for integer ALU operations was not available but could have been used
- system.cpu.dache.overall_miss_rate::cpu.data: data cache miss rate

gem5script.py takes some parameters that affect the simulation. For example, you can adjust the data cache size with the --l1d_size option. Try running:

./gem5.opt gem5script.py \
    --cmd=./blocked-matmul \
    --directory=blocked-matmul-output-hugecache \
    --l1d_size=2MB

[The \ represents a line continuation. This indicates that this is meant to all be one command, even though there are newlines in the middle of it.]

Notice that the value of system.cpu.dcache.overall_miss_rate::cpu.data in stats.txt from this simulation is much lower than the first simulation.

Note that the simulator takes on the order of a minute to simulate 5 milliseconds of simulated runtime, a slowdown of around 10000 times. For this reason, we will generally be only running very short benchmark programs in this assignment.

Supplying Command-Line Arguments

To supply command-line arguments to a program you run under gem5, you can pass the --options option, for example:

./gem5.opt gem5script.py \
    --cmd=./queens \
    --directory=queens-default-output \
    --options='-c 10'

runs the command queens -c 10 in the simulator.

This is handled by the code in the create_process() function gem5script.py.

When supplying filenames, you should generally provide the full path to the program. The program will be run inside the output directory instead of in your current working directory. This is why we suggest using realpath above.

Varying Simulation Parameters

Rather than vary cache parmaters, you will be primarily be varying parameters of the simulated out-of-order CPU. Most of these parameters don’t have convenient command-line options. Instead, you will need to modify the create_cpu() function in gemscript.py to change the parameters as you choose. A comment before this function shows some examples of how to modify parameters, including some which are more complicated to set. (In the original version of the benchmarks archive, it also has an error in the first paragraph, claiming the function varies the number of reorder buffers by default like is described below, but it does not as supplied.) If you aren’t familar with Python note that that text between triple quotes (""") in this file are effectively comments, including several paragraphs of text before def create_cpu.

You should edit parameters by modifying gem5script.py’s create_cpu function, around where there is a comment reading YOUR CUSTOMIZATION CODE HERE. The code you add can look like:

the_cpu.numROBEntries = 100

to run with 100 ROB (reorder buffer) entries instead of the default. For a full list of parameters in the out-of-order CPU model, look at the gem5 source file src/cpu/o3/O3CPU.py.

You can also edit the get_options() function to make the script support additional command-line options which you can access via the ‘options’ parameter to the create_cpu() function. This can allow you to avoid changing the python file every time you want to run a different simulation.

For example, if you add code like:

parser.add_option('--vary', type=str, default='none')

you could do something like

if the_cpu.vary == 'numROB100':
    the_cpu.numROBEntries = 100
elif the_cpu.vary == 'none':
    pass # python for do nothing
else:
    eprint("ERROR: unrecgonized --vary option")
    sys.exit(1)

to let you run

./gem5.opt gem5script.py \
    --cmd=./blocked-matmul \
    --directory=blocked-matmul-output-rob100 \
    --vary=rob100

to try 100 ROB entries outputting to blocked-matmul-output-rob100 and:

./gem5.opt gem5script.py \
    --cmd=./blocked-matmul \
    --directory=blocked-matmul-output-default

to try with the default configuration.

Stages of the Simulated Processor

The simulated processor executes instructions in several stages, which are important to understand to reason about the statistics reported:

fetch: instructions are fetched from the instruction cache. By default, the processor fetches up to eight instructions at a time. The number fetched is controlled by the fetchWidth parameter. This stage does branch prediction and branch target prediction to determine what to fetch.
decode: instructions from the fetch stage are preproecssed. This stage handles execution of unconditional branches (whose target address is not in a register, etc.). In a real processor, this would be where instruction register numbers, etc. would be identified (but the simulation does not work this way internally; it only simulates the timings). The maximum number of instructions processed per cycle is controlled by the decodeWidth parameter.
rename: entries in the re-order buffer and the instruction queue (approximately a shared reservation buffer) are allocated for each instruction. Register operands of the instruction are renamed, updating a renaming map (blocking if not enough free registers are available). The maximum number of instructions processed per cycle is controlled by the renameWidth parameter.
dispatch/issue: instructions whose renamed operands are available are dispatched to functional units. For loads, stores, they are dispatched to the Load/Store Queue (LSQ). The simulated processor has a single instruction queue from which all instructions issue. Ordinarily instructions are taken in-order from this queue The maximum number of instructions processed per cycle is controlled by the dispatchWidth parameter.
execute: the functional unit actually processes their instruction. Each functional unit may have a different latency (number of cycles until it produces a result). Conditional branch mispredictions are identified here. The maximum number of instructions processed per cycle is controlled by the configuration of the types of operations, latencies, and counts of the functional units available.
writeback: send the result of the instruction into the corresponding physical register (if any), marking the register as available and permitting the issue of dependent instructions. Update the reorder buffer entry for the instruction. The maximum number of instructions processed per cycle is controlled by the wbWidth parameter.
commit: process the reorder buffer, freeing up reorder buffer entries. A second renaming map is updated. The maximum number of microps processed per cycle is controlled by the commitWidth parameter.

In the event of branch misprediction, trap, or other speculative execution event, “squashing” can occur at all stages of this pipeline. When a pending instruction is squashed, it is removed from the instruction queues, reorder buffers, requests to the instruction cache, etc.

The fetch, decode, rename and commit stages process instructions in program order. Other stages process instructions out-of-order based on availability of operands and results.

The simulated processor also lets one configure the latency between many of these stages — how many clock cycles it takes an instruction to pass from one phase to another in the best case.

To deal with the complicated instructions that are very common in X86, like a single instruction that performs a load and an add, the simulated processor splits many instructions into multiple ‘microoperations’. Confusingly, it is often not clear whether statistics are referring to microops (and calling them instructoiuns) or real instructions. Generally, statistics about the issue, execute, and writeback phase will always concern microops (even if their descriptions use the word “instructions”), and statistics about the commit phase will make it clear which are referred to.

Some Terminology in the Program Statistics

IEW: issue and execute and writeback. The portion of the simulated processor that reads the queue from rename and managed the load/store queue and the functional units.
IQ: instruction queue, where instructions are placed when they are ready to be executed on functional units or the caches.
LSQ, LQ, SQ: Load/Store Queue, Load Queue, Store Queue: queues that hold pending memory operations. These are also responsible for tracking accessed memory addresses to ensure that out-of-order load/store instructions that refer to the same memory address produce the correct results.
non-speculative instruction: An instruction that cannot be executed speculatively. These instructions must be executed when their reorder buffer entry is committed and no earlier. An example of such an instruction is the system call instruction.
squashing: undoing the effect of instructions due to a branch misprediction, fault, or other kind of incorrect speculative execution. Note that this occurs at all stages of the pipeline, rather than waiting for instructions to reach the reorder buffer.

Supplied Benchmark Programs

I have selected several benchmark programs that should explore a range of demands on the simulated processor. For each program, I have a suggested way to run the program that should take not much more than a minute to simulate each time. If the program takes command-line arguments, you should pass the arguments when using them in simulations as described above.

blocked-matmul: a 2x2 register-blocked matrix multiplication of two 84x84 matrices. The matrices are pseudorandomly generated and all sizes are hard-coded. Source code for this benchmark is in blocked-matmul.c.

Our suggsted command-line for this program is
```
./blocked-matmul
```
It takes no command-line arguments.

This program was selected because it should have a mix of cache accesses and floating point operations
BFS: computes a breadth-first search problem. This is taken from the Problem Based Benchmark Suite. Source code for this benchmark, along with utilities for generating graph data is in the breadthFirstSearch directory.

We supply some example graphs in the inputs directory. Our suggested command-line for this program is
```
./BFS path/to/RL3k.graph
```
where path/to/rand-weighted-micro.graph is the full path to the rand-weighted-micro.graph supplied in the inputs directory of the benchmarks archive and via this link. You may get this with a command like:
```
realpath inputs/RL3k.graph
```
This program was selected because it should have poor data cache locality.
sha: computes the SHA-1 cryptographic hash of its input. Source code for this benchmark is in the sha-src directory.

Our suggested command-line for this program is
```
./sha path/to/example-sha-input.txt
```
where path/to/example-sha-input.txt is the full path to the example-sha-input.txt supplied in the inputs directory. You may get this with a command like:
```
realpath inputs/example-sha-input.txt
```
This program was selected because it should be integer-operation intensive and very friendly for branch prediction and cache.
queens: solves the N queens problem for an N specified as an argument. This is taken from the LLVM (a compiler toolkit) test suite, but based on comments in the source file queens.c, it is much older. Source code is in the queens.c file.

Our suggested command-line for this program is
```
./queens -c 10
```
The -c option indicates to count solutions instead of printing out any solutions.

This program was selected because it should be very friendly to the cache, but very challenging for branch prediction.

Tasks

Part A: General attributes of benchmark programs (required for checkpoint)

Test each program with the suggested command-line above and create a table reporting the following with the default settings for the simulated processor:

Portion of microops run that are:
- integer instructions other than multiplies/divides
- integer multiplies or divides
- floating-point instructions
- branch and function call instructions
- memory/cache accessing instructions
Mean number of microops generated per program instruction
Portion of branches correctly predicted
Average instructions executed per cycle
Average microops executed per cycle
Instruction cache hit rate
Data cache hit rate

Part B: Use the pipeline viewer (required for checkpoint)

gem5 includes a pipeline viewer, briefly described here. Try viewing the pipeline during the execution of the blocked-matmul program. First run blocked-matmul with debugging options to record pipelines from simulation ticks 500000000 to 501000000:

./gem5.opt --debug-flags=O3PipeView --debug-start=500000000 --debug-file=.`/trace.out \
    gemscript.py --directory=matmul-trace -c ./blocked-matmul -m 501000000

The -m option specifies to terminate the simulation after running 501000000 ticks. After running this command, you will have the usual output the matmul-trace directory, and file called trace.out in that directory. If you look at this file, you will see that it cotnains a list of instructions, along with the clock tick in which they completed each phase of their execution. If an instruction was speculatively executed and did not complete some phase of its execution, ‘0’ is listed instead.

You can pass this trace.out to gem5’s pipeline viewer to get an easier to read output:

PATH/TO/GEM5/util/o3-pipeview.py --cycle-time=500 --color ./trace.out -o pipeview.out

The –cycle-time option specifies the clock rate of the simulated CPU in simulation ticks. This command produces an output file call pipeview.out, which you can view with a command like:

less -r pipeview.out

Note that the output is very wide by default. You can adjust the width of the output by supplying a value for the -w option:

PATH/TO/GEM5/util/o3-pipeview.py -w 40 --cycle-time=500 --color ./trace.out -o pipeview.out

For each instruction, this indicates what the address of the instruction was, the kind of instruction it was (using gem5’s internal name) and how long each phase of its completion took.

Some lines have instruction names prefixed with ‘-----’. These represent instructions which were executed speculatively and `squashed’ (its effects cancelled), for example because the branch was mispredicted.

Include your pipeview.out in your submission.

Part C: Effective cache miss penalty versus miss latency (required for checkpoint)

For this section, use only the BFS program. Notice that it has a relatively high data cache miss rate. The default configuration of the processor not only overlaps cache misses with other comptuation but can have many outstanding cache misses.

The simulated cache has a number of MSHR (Miss Status Handling Registers) that each can make up to one request at a time to the memory system. The number of MSHRs is controlled by the mshrs parameter of each cache object (like the dcache variable produced by the supplied code).

Adjusting the number of MSHRs in the data cache to 1 (from the default of 8 4) for the BFS program. Compare the performance to a higher number of MSHRs and the system.cpu.dcache.blocked_cycles::no_mshrs statistics from each run.

What was the difference in performance?
Try adjusting the data cache size to decrease the number of cache misses to a neglible number. What was the difference in performance?
Examine the system.cpu.dcache.overall_miss_latency statistic, which shows the total number of simulation ticks memory accesses triggered by a cache miss took. Based on this (and any other statistics), estimate the performance gained by the ability to overlap cache misses with other instructions (both other cache acceses and non-cache instructions like integer arithmetic). Identify any major limitations of your estimate.

The overall_miss_latency statistic counts time when there are two active cache misses twice. For example, if there is a miss at time 1 that takes until time 11 to resolve and a miss at time 5 that takes until time 14, the overall_miss_latency will be (11-1) + (14-5) = 19.

When making your estimate, note that the simulated processor always overlaps cache misses with other operations, even when configured with one MSHR, so it can only handle one cache miss at a time.

Part D: Branch Prediction Benefits

Examine the counts of `squashed’ instructions and memory requests ignored due to squashing.

Based on the counters, estimate what portion of throughput is lost due to branch misprediction for each benchmark program. Identify any major limitations of your estimate.
Based on the rate of branches, estimate the total benefit of the correct branch prediction for each benchmark. Identify any major limitations of your estimate.
Change the branch predictor to the NeverTakenBP we have provided (by setting the_cpu.branchPred = NeverTakenBP()) and run each benchmark again. This branch predictor predicts every conditional branch as not taken resulting in a very low rate of correct branch prediction. How does the performance difference compare to your estimates?

Part E: Achievable bandwidth

Examine the instruction mixes for each benchmark program based on committed microoperations (the system.cpu.commit.op_class_0:: counters). Note that program instructions may be split into multiple microoperations each of which is executed indiivdually on functional units.

Examine the available functional units in the gem5 source code in src/o3/FuncUnitConfig.py. In FuncUnitConfig.py, the counts indicate the “width” of each functional unit available. Unless otherwise specified with the pipelined=False, each functional unit can dispatch count instructions per cycle. Unless specified with the opLat option, each functional unit takes 1 cycle to produce a result after a value is dispatched to it.

Based on the instruction mix and the available functional units, if the other stages were not bottlenecks, what is the maximum average number of microoperations which could be dispatched per cycle? Assume that the mix of instructions is approximately constant throughout each benchmark’s execution.
How close is each program to the maximum possible average computation rate it could achieve given its instructions and the available functional units?
Computation speed is also limited by the widths of the non-execute phases. Examine the counters and histograms in the stats.txt files for:
- instructions fetched per cycle
- instructions decoded per cycle
- instructions renamed per cycle
- instructions dispatched per cycle
- instructions issued per cycle
- instructions committed per cycle
- instructions squashed per cycle (when undoing a mispredicted branch in the commit phase)
Note that, by default, each of the maximum number of instructions (usually actually microops) per cycle for each of these phases is 8.

Consider changing all these widths to 4. Which benchmark seems like it should be most affected and why? Which benchmark seems like it should be least affected and why?
Now try running each benchmark with all these widths changed to 4. How does performance actually change?
A fundamental limit on out-of-order processors is the “dependence limit” — the inability to issue instructions faster than their dependencies can be computed. The processor in question tries to evade the dependence limit somewhat by speculatively executing past unexecuted branches, and speculatively executing memory accesses before it knows if their addresses will conflict, but there is a limit on how far it can go.

Try to run the simulated processor much closer to this dependence limit by increasing all of
- all the widths from Part E.3
- the number of functional units of each type
- the number of reorder buffer entries
- the number of physical registers of every type, and
- the size of the load/store queues. How much does the instructions per cycle increase by?

Part F: Miscellaneous Resource

Choose two of following parameters:

a. the latency of the available functional units; b. the size of the reorder buffer; c. the size of the instruction queue; d. the sizes of the load and store queues; e. the number of physical registers (internal register names); f. the size of the branch predictors;

For each parameter you choose, look for evidence in benchmark statistics from benchmarks you already ran that changing it is likely to have a small or large effect on performance. What do you find? Do you think it is possible to infer the effects of the parameter from the statistics? Explain why or why not.
Try varying the parameter and running the benchmark programs. How much does performance change? Does this match your expectations?

Note on errors

If you get an error about gettid being unimplemented, check the program.err file. Often this means the program failed and tried to execute an error handling routine that needed to know some process information that gem5 does not simulate properly.

Building benchmarks and gem5

You should not need the instructions in the section if you are using our prebuilt archives on a 64-bit Linux system.

Rebuilding the benchmarks

The benchmarks includes a Makefile. If you have GNU make and gcc and g++ installed, running make clean, then make should rebuild all the benchmark programs for your system.

If you are using one of our supplied VMs, you may need to install these packages. One way to do this is sudo apt-get install build-essential.

If you’d like to build the benchmarks with a different compiler and/or differnt compiler options, you will need to edit the Makefile and the Makefile in breadthFirstSearch/deterministicBFS and the Makefile in sha.

Rebuilding gem5

These instructions are for building gem5 on a Unix-like system.

To build gem5 from source, first make sure the following dependencies (listed with Ubuntu/Debian package names) are installed (from the video on this page):

mercurial
scons
swig
gcc
g++
m4
python-dev (development packages for python 2)
libgoogle-perftools-dev (development packages for google-perftools)

Then acquire the gem5 source. You can take our prebuilt package and remove the build directory. Alternatively, you can checkout gem5 from mercurial yourself using

hg clone http://repo.gem5.org/gem5

and then apply our patch using

patch -p 1 < ./gem5.patch

Then use scons build/X86/gem5.opt to actually build the gem5 directory.

Linux quick start

From a default graphical install of Ubuntu, log in and use alt-F2 and type gnome-terminal to get a command line window. (Alternately, if using vagrant, vagrant ssh will give you a command-line window in the virtual machine.)

Then, here are some useful commands (adapted from our advice in CS 3330):

pwd: identify the current directory you are in
ls: list files in the current directory
mkdir: create a new directory
man ls: display a manual page for ls. This also works for essentially every other command.
cd dirname: enter the dirname directory
cd ..: enter the parent directory of the current directory; for example, if you are in /home/cr4bd/benchmarks, the parent directory is /home/cr4bd. You can also generally use .. to refer to files in the parent directory, for example ../foo.txt names a file called foo.txt in the parent directory.
./foo runs an executable called foo in the current directory.
history: show commands you ran recently.
wget URL to download the file from URL to the current directory.
tar -zxvf FILE.tar.gz to extract the tar archives we provide.
gedit &: open a graphical text editor. Other possible graphical editors include geany and kate. The & indicates to run the command in the background.
nano or pico: open an editor that operates in the terminal (likely necessary if you are SSHing into your VM.)
sudo apt install PACKAGE. Install a software package. If you’re using Ubuntu 16.04, you can search possible packages on this page, selecting the xenial distribution. The sudo indicates to run the specified command as root. It will require your password. In the VM image we supply, the password is password.

In the terminal, you can use up-arrow and down-arrow to repeat recent commands. If you press TAB, this will try to complete your current command, for example cd beTAB will probably result in cd benchmarks if a directory named benchmarks exists in the current directory. If there were multiple directories starting with be, then it would not complete to any name, but pressing TAB twice would give you a list of the possible completions.