Checkpoint: due 12 September 2016 11:59PM
Due: due 23 September 2016 11:59PM 26 September 2016 11:59 AM
The design of memory hierarchies has a very large effect on application performance. In tuning a memory hierarchy, architects have a large number of parameters, including the sizes of each level of cache, the layout (associativity) of each level of cache, prefetching policies, how caches are distributed between cores, etc. If these decisions have a substantial effect on application performance, they ought to be observable from applications. Your assignment is to produce microbenchmarks that will reveal many of these parameters.
Choose a system you have access to. Construct microbenchmarks from whose results one should be able to infer its value (time, size, or whether a feature exists) of each of the following:
For checkpoint:
For final submission:
and at least two of:
You are encouraged to observe performance counters, such as numbers of last-level cache misses, that may directly reflect cache accesses and other memory system activity you are trying to benchmark. If practical, however, your benchmark should reveal the parameter based on performance alone.
Note that due to features of modern processors, you may not be able to obtain reasonable results for some measurements despite substantial effort that would not be necessary on some platforms. In this case, I do not want to punish you for choosing a more interesting microarchitecture; instead, please document what you tried and what the negative results were.
You may read the value of the parameters manually from your benchmark results (for example, based on the shape of the graph) instead of automating this process (but it would be awesome if you did automate it).
A single zip or tar archive containing:
A document in PDF, HTML, or text format:
To read the time stamp counter (TSC), which is helpful for timing, from C with GCC or Clang on 64-bit x86, you can use:
#include <stdint.h> /* for uint64_t */
/* ... */
uint64_t read_tsc(void) {
uint64_t hi, lo;
/*
* Embed the assembly instruction 'rdtsc', which should not be relocated (`volatile').
* The instruction will modify r_a_x and r_d_x, which the compiler should map to
* lo and hi, respectively.
*
* The format for GCC-style inline assembly generally is:
* __asm__ ( ASSEMBLY CODE : OUTPUTS : INPUTS : OTHER THINGS CHANGED )
*
* Note that if you do not (correctly) specify the side effects of an assembly operation, the
* compiler may assume that other registers and memory are not affected. This can easily
* lead to cases where your program will produce difficult-to-debug wrong answers when
* optimizations are enabled.
*/
__asm__ volatile ( "rdtsc" : "=a"(lo), "=d"(hi));
return lo | (hi << 32);
}
Because of prefetching, cache misses from access patterns that try to access memory in order may be completely hidden. Consider other access patterns.
man 2 madvise
in Linux: lets you advise kernel to avoid automatically using huge pages or consider automatically using huge pagesReverse Engineering Intel Last-Level Cache Complex Addressing Using Performance Counters, 18th Int’l Symposium on Research in Attacks, Intrusions, and Defenses (RAID’15).
Are AES x86 Cache Timing Attacks Still Feasible?
man 5 proc
in Linux: includes documentation for the special file /proc/PID/pagemap
which will provide a process’s virtual to physical page mapping.man 2 mbind
in Linux: on NUMA systems, lets you specify which NUMA regionmemory is allocated in.