Changelog:
- 12 Oct 2023: note about
module load valgrind
being required on department machines - 13 Oct 2023: update skeleton code archive’s Makefile default/all target (not mentioned in instructions here) to work
- 16 Oct 2023: write some general hints for all problems at the end
1 Your Task
Compatability note: To ensure consistent results, please test your solution on the department machines (via NX or SSH) Though the assignment is constructed so to minimize the effect of differences between compiler versions, differences between the instructions generated by can cause different simulated miss rates on different platforms. We intend to accept any solution that worked for you when grading (including by testing with several compilers and compiler options), but it would be simpler/more certain for you to verify your solution on the department machines.
Download the skeleton code we provide in this archive.
Modify or replace the supplied programs
prog1.c
,prog2.c
, andprog3.c
so that, in the simulation under valgrind cachegrind:(The skeleton code we supply has a structure that should be useful and comments suggesting particular modifications that might be useful, but you can perform whatever modifications you think would be helpful, including completely replacing the programs.)
Note: in the sizes below 1KB = 1024 bytes, so all cache sizes are a power of two number of bytes.
For
prog1.c
: Write a program that (when compiled with the Makefile) performs at least 10 million cache references and achieves a <10% miss rate with 32KB (215 bytes), 2-way data cache with 64B blocks but a >90% miss rate with a 16KB (214 bytes), 2-way data cache with 64B blocks.Hint: a program that would perform best with between 16KB and 32KB of data cached would be likely to achieve this.
Hint: try thinking about how to solve this with a 8B direct-mapped cache with 8B blocks versus a 16B direct-mapped cache with 8B blocks instead, and then work on generalizing your solution to larger caches.
For
prog2.c
: Write a program that performs at least 10 million cache references and achieves a <10% miss rate with 32KB (215 bytes), 4-way data cache with 64B blocks but a >90% miss rate with a 32KB, 2-way data cache with 64B blocks. Submit your program asprog2.c
Hint: Try to store 3 things in a single set in the two-way cache and in the four-way cache.
Hint: The 4-way cache has 6 cache offset bits and 7 cache index bits. The 2-way cache has 6 cache offset bits and 8 cache index bits. This means that if we have several addresses which are the same in bits 6 through 14, they will all have the same set index in each cache.
For
prog3.c
: Write a program that performs at least 10 million cache references and achieves a <60% miss rate with 32KB (215 bytes), 2-way data cache with 128B blocks but a >90% miss rate with a 32KB, 2-way data cache with 64B blocks. Submit your program asprog3.c
Hint: take advantage of spatial locality
Test your programs using the instructions below.
Use
make submit
to create a .tar file, and upload that file to the submission site.
2 Valgrind cachegrind
valgrind
, which is installed on the department machines,
includes a cache simulator that can simulate a variety of data cache
configurations and report the results.
On the department machines, you will need to run
module load valgrind
to use the valgrind command.
For example, running something like:
valgrind --D1=8192,2,64 --I1=8192,2,64 --tool=cachegrind ls
runs ls
simulating a 8KB, 2-way data cache with 64-byte
blocks and a 8KB, 2-way instruction cache with 64-byte blocks and
reports the results of the simulation in the output, which might look
like:
==518718== Cachegrind, a cache and branch-prediction profiler
==518718== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
==518718== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==518718== Command: ls
==518718==
--518718-- warning: L3 cache found, using its data for the LL simulation.
check_miss_rate.py part1a part1c prog1.c prog3 prog4.c prog5.c
Makefile part1b part1d prog2.c prog3.c prog5 prog-template.c
==518718==
==518718== I refs: 515,059
==518718== I1 misses: 1,961
==518718== LLi misses: 1,825
==518718== I1 miss rate: 0.38%
==518718== LLi miss rate: 0.35%
==518718==
==518718== D refs: 174,210 (127,904 rd + 46,306 wr)
==518718== D1 misses: 5,769 ( 4,585 rd + 1,184 wr)
==518718== LLd misses: 4,125 ( 3,049 rd + 1,076 wr)
==518718== D1 miss rate: 3.3% ( 3.6% + 2.6% )
==518718== LLd miss rate: 2.4% ( 2.4% + 2.3% )
==518718==
==518718== LL refs: 7,730 ( 6,546 rd + 1,184 wr)
==518718== LL misses: 5,950 ( 4,874 rd + 1,076 wr)
==518718== LL miss rate: 0.9% ( 0.8% + 2.3% )
Lines starting with ==NUMBER==
are the output of
valgrind
’s cachegrind tool, the number indicates the
process ID (in case multiple programs are run in parallel). Cachegrind
simulates a two-level cache hierarchy (a first-level instruction cache
and first-level data cache which used a single last-level
cache). The fields of the cachegrind output are:
I refs
: instruction cache references (accesses)I1 misses
: first-level instruction cache missesLLi misses
: last-level cache misses for instructionsI1 miss rate
: first-level instruction cache miss rateLLi misses
: last-level cache miss rate, only counting instruction accessesD refs
: data cache referencesD1 misses
: first-level data cache misses; also split up into reads (rd
) and writes (wr
)LLd misses
: last-level cache misses for data; also split up into reads (rd
) and writes(wr
)D1 miss rate
: first-level data cache miss rate; also split up into read miss rate and write miss rateLLd miss rate
: last-level cache miss rate, only counting data accessesLL refs
: last-level cache missesLL misses
: last-level cache miss countLL miss rate
: last-level cache miss rate
The fields fields we will be concerned about are the ones for the
first-level data cache (D refs
, D1 misses
, D1
miss rate`).
3 Supplied files
Below, we will ask you to write programs to achieve certain cache miss rates in cachegrind’s simulation.
To do this we supply:
Makefile — scripts for compiling and running your programs and for creating an archive for submission
prog1.c, prog2.c, prog3.c — templates for the programs you need to produce below. The template program will achieve a high cache miss rate.
3.1 Template programs
The supplied template program which sets up a global array, and then accesses that global array in a loop:
for (int i = 0; i < ITERS; ++i) {
j = global_array[j];
}
When accessing the array, the program starts at
global_array[0]
, then reads an element of array to
determine what element to access next (the index stored in
j
). So, for example, if global_array[0] = 4 and
global_array[4] = 8 and global_array[8] = 0, then the program would
access global_array[0], then global_array[4], global_array[8], then
global_array[0], then global_array[4], and so on until ITERS accesses
are made.
(You can think of the array as a big cyclic singly-linked list
starting at global_array[0]; each array element is a list node
containing just a next pointer.)
The way in which it accesses this global array can be changed by changing how the array is initialized; our supplied code performs a regular pattern explained in a comment in the template file. You can adjust this pattern in by changing some settings in the program. If those settings are insufficient for the type of access pattern you want, you could also write new code to initialize the array.
If you would rather write your own code entirely to achieve these cache miss rates, you are welcome to do so, but I believe the supplied template code would make this task easier.
4 Compiling and Testing
For any of these commands to work on the department machines,
you will need to first run
module load valgrind
.
4.1 Testing individual programs
You can compile your programs using a command like:
make prog1
Please use this instead of compiling without the Makefile to ensure your compiler options are consistent with ours.
Also, the Makefile provides testing commands like:
make run-prog1
which will compile (if needed) and run a program under valgrind cachegrind with the cache settings in the tasks above.
4.2 Testing everything
We have a standalone test:
make test-all
which recompiles and runs all the programs and parses the output to determine whether you’ve achieved the miss rate targets. To run this test on department machines, you will need to run
module load python
first.
5 General hints
For all the programs, it can be helpful to work out what happens with very small caches that differ in the same ways. For example, for program 1, we suggest doing this for a 8B cache versus a 16B cache. You can come upw ith similar examples for each of the other programs.
When doing this, a general question to ask is
what can I access to make the two caches store differ things?
Initially, the answer may benothing
— you need to fill up part of the caches before they can store different things.Once the caches store different things, though, you can use this determine what you should access next. For example, for program 1, if you determine that the smaller cache has {array[0], array[1], array[2], array[3]} and the larger one has {array[0], array[1], array[2], array[3], array[4], array[5]}, then accesses either array[4] or array[5] would be a way to have a miss in the smaller cache and a hit in the larger one.
If you decide you’d like to do an access pattern like global_array[0], global_array[53], global_array[195], global_array[0], global_array[53], global_array[195], etc., then you might find that this is hard to do by setting MAX, SKIP. You could configure a pattern like this
manually
by replacing the loop that uses MAX and SKIP with something like:global_array[0] = 53; global_array[53] = 195; global_array[195] = 0;
Then the second loop (the one that uses
j
) will follow that the 0, 53, 195, 0, 53, 195, etc. pattern