Your Task

Compatability note: To ensure consistent results, please test your solution on the department machines (via NX or SSH) Though the assignment is constructed so to minimize the effect of differences between compiler versions, differences between the instructions generated by can cause different simulated miss rates on different platforms. We intend to accept any solution that worked for you when grading (including by testing with several compilers and compiler options), but it would be simpler/more certain for you to verify your solution on the department machines.

  1. Download the skeleton code we provide in this archive.

Lab portion

  1. For the lab, modify or replace the supplied programs prog1.c so that, in the simulation under valgrind cachegrind:

    (The skeleton code we supply has a structure that should be useful and comments suggesting particular modifications that might be useful, but you can perform whatever modifications you think would be helpful, including completely replacing the programs.)

    1. For prog1.c: Write a program that (when compiled with the Makefile) performs at least 10 million cache references and achieves a <10% miss rate with 32KB, 2-way data cache with 64B blocks but a >90% miss rate with a 16KB, 2-way data cache with 64B blocks.

      Hint: a program that would perform best with between 16KB and 32KB of data cached would be likely to achieve this.

      Hint: try thinking about how to solve this with a 8B direct-mapped cache with 8B blocks versus a 16B direct-mapped cache with 8B blocks instead, and then work on generalizing your solution to larger caches.

  2. Test your programs using the instructions below.

  3. Upload your prog1.c to the submission site. If you collaborated with other students, please add a comment at the top explaining who you worked with.

Homework portion

  1. Modify or replace the supplied programs prog2.c and prog3.c so that, in the simulation under valgrind cachegrind:

    1. For prog2.c: Write a program that performs at least 10 million cache references and achieves a <10% miss rate with 32KB, 4-way data cache with 64B blocks but a >90% miss rate with a 32KB, 2-way data cache with 64B blocks. Submit your program as prog2.c

      Hint: you want to create conflict misses by accessing values that are spaced apart in a particular manner. Try working out what you would need to do with a very small cache first.

    2. For prog3.c: Write a program that performs at least 10 million cache references and achieves a <60% miss rate with 32KB, 2-way data cache with 128B blocks but a >90% miss rate with a 32KB, 2-way data cache with 64B blocks. Submit your program as prog3.c

      Hint: take advantage of spatial locality

  2. Choose any one of the programs prog4.c and prog5.c to modify or replace so that:

    1. For prog4.c: Write a program that performs at least 10 million cache references and achieves a <10% miss rate with 32KB, 2-way data cache with 64B blocks but a >90% miss rate with a 32KB, 2-way data cache with 128B blocks. Submit your program as prog4.c

      Hint: take advantage of the smaller block size allowing the cache to store more blocks. In my solution, I used a less regular access pattern than the previous parts (rather than just changing MAX, ITERS, SKIP, I wrote new code to initialize global_array.)

    2. For prog5.c: Write a program that performs at least 10 million cache references and achieves a <40% miss rate with 32KB, 2-way data cache with 128B blocks but a >50% miss rate with a 32KB, 2-way data cache with 64B blocks. Submit your program as prog5.c

  3. Test your programs using the instructions below.

  4. Use make submit-hw to create a .tar file, and upload that file to the submission site.

Valgrind cachegrind

valgrind, which is installed on the department machines, includes a cache simulator that can simulate a variety of data cache configurations and report the results.

For example, running something like:

      valgrind --D1=8192,2,64 --I1=8192,2,64 --tool=cachegrind ls

    

runs ls simulating a 8KB, 2-way data cache with 64-byte blocks and a 8KB, 2-way instruction cache with 64-byte blocks and reports the results of the simulation in the output, which might look like:

      ==518718== Cachegrind, a cache and branch-prediction profiler
==518718== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
==518718== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==518718== Command: ls
==518718== 
--518718-- warning: L3 cache found, using its data for the LL simulation.
check_miss_rate.py  part1a  part1c  prog1.c  prog3    prog4.c  prog5.c
Makefile            part1b  part1d  prog2.c  prog3.c  prog5    prog-template.c
==518718== 
==518718== I   refs:      515,059
==518718== I1  misses:      1,961
==518718== LLi misses:      1,825
==518718== I1  miss rate:    0.38%
==518718== LLi miss rate:    0.35%
==518718== 
==518718== D   refs:      174,210  (127,904 rd   + 46,306 wr)
==518718== D1  misses:      5,769  (  4,585 rd   +  1,184 wr)
==518718== LLd misses:      4,125  (  3,049 rd   +  1,076 wr)
==518718== D1  miss rate:     3.3% (    3.6%     +    2.6%  )
==518718== LLd miss rate:     2.4% (    2.4%     +    2.3%  )
==518718== 
==518718== LL refs:         7,730  (  6,546 rd   +  1,184 wr)
==518718== LL misses:       5,950  (  4,874 rd   +  1,076 wr)
==518718== LL miss rate:      0.9% (    0.8%     +    2.3%  )

    

Lines starting with ==NUMBER== are the output of valgrind’s cachegrind tool, the number indicates the process ID (in case multiple programs are run in parallel). Cachegrind simulates a two-level cache hierarchy (a first-level instruction cache and first-level data cache which used a single last-level cache). The fields of the cachegrind output are:

The fields fields we will be concerned about are the ones for the first-level data cache (D refs, D1 misses, D1 miss rate`).

Supplied files

Below, we will ask you to write programs to achieve certain cache miss rates in cachegrind’s simulation.

To do this we supply:

Template programs

The supplied template program which sets up a global array, and then accesses that global array in a loop. When accessing the array, the program reads an element of array to determine what element to access next. (You can think of the array as a big cyclic linked list.)

The way in which it accesses this global array can be changed by changing how the array is initialized; our supplied code performs a regular pattern explained in a comment in the template file. You can adjust this pattern in by changing some settings in the program. If those settings are insufficient for the type of access pattern you want, you could also write new code to initialize the array.

If you would rather write your own code entirely to achieve these cache miss rates, you are welcome to do so, but I believe the supplied template code would make this task easier.

Compiling and Testing

Testing individual programs

You can compile your programs using a command like:

      make prog1

    

Please use this instead of compiling without the Makefile to ensure your compiler options are consistent with ours.

Also, the Makefile provides testing commands like:

      make run-prog1

    

which will compile (if needed) and run a program under valgrind cachegrind with the cache settings in the tasks above.

Testing everything (homework only)

We have a standalone test:

      make test-all

    

which recompiles and runs all the programs and parses the output to determine whether you’ve achieved the miss rate targets. To run this test on department machines, you will need to run

      module load python

    

first.