CS 6354: Graduate Computer Architecture

1 Homework 1

Checkpoint: due 12 September 2016 11:59PM

Due: due ~~23 September 2016 11:59PM~~ 26 September 2016 11:59 AM

The design of memory hierarchies has a very large effect on application performance. In tuning a memory hierarchy, architects have a large number of parameters, including the sizes of each level of cache, the layout (associativity) of each level of cache, prefetching policies, how caches are distributed between cores, etc. If these decisions have a substantial effect on application performance, they ought to be observable from applications. Your assignment is to produce microbenchmarks that will reveal many of these parameters.

Choose a system you have access to. Construct microbenchmarks from whose results one should be able to infer its value (time, size, or whether a feature exists) of each of the following:

For checkpoint:

For each data or unified (data and instruction) cache:
- The size of that cache
- The size of blocks (AKA lines) in that cache
For each data or unified TLB:
- The size (number of entries) of that TLB
The single-core sequential throughput (read and write) of main memory
The single-core random throughput (read and write) of main memory

For final submission:

For each data or unified (data and instruction) cache
- The associativity of that cache
- The single-core throughput of that cache
- The latency of a cache accesses
For each instruction cache:
- The size of that cache
For each data or unified TLB:
- The associativity of that TLB
The maximum latency of a main memory access
Whether hardware prefetching is supported

and at least two of:

If the system is NUMA (non-uniform memory access time), the range of latencies of memory access times
If the system is NUMA (non-uniform memory access time), the range of memory access (read and write) bandwidths
If hardware prefetching is supported, the range of strides detected by the prefetcher (including whether backwards strides are detected)
If hardware prefetching is supported, whether it works across page boundaries even when pages are adjacent in virtual memory but not in physical memory
If hierarchical paging is used, the size of the caches for higher-level page table entries (Intel calls these page directories)
If hierarchical paging is used, the associativity of the caches for higher-level page table entries (Intel calls these page directories)
If large pages are supported, the size of each data or unified TLB when holding large pages
If large pages are supported, the associativity of each data or unified TLB when holding large pages
If the system has multiple physical cores, the maximum multi-core throughput of main memory.
If the system has multiple logical cores, the maximum multi-thread random throughput of main memory.
For each instruction TLB, the size of that TLB
For each instruction TLB, the associativity of that TLB
For each instruction cache, the size of blocks in that cache
For each instruction cache, the associativity of that cache
(or you may propose something else interesting to measure, with approval)

You are encouraged to observe performance counters, such as numbers of last-level cache misses, that may directly reflect cache accesses and other memory system activity you are trying to benchmark. If practical, however, your benchmark should reveal the parameter based on performance alone.

Note that due to features of modern processors, you may not be able to obtain reasonable results for some measurements despite substantial effort that would not be necessary on some platforms. In this case, I do not want to punish you for choosing a more interesting microarchitecture; instead, please document what you tried and what the negative results were.

You may read the value of the parameters manually from your benchmark results (for example, based on the shape of the graph) instead of automating this process (but it would be awesome if you did automate it).

2 Deliverables

A single zip or tar archive containing:

A report, described below.
Source code for your benchmarks.

3 Report

A document in PDF, HTML, or text format:

The identity of the system you choose to benchmark.
For each of your microbenchmarks:
- A brief (approx. 1-5 sentences) explanation of what your microbenchmark does and why the microbenchmark should measure the parameters it targets. If you based this technique on prior work you found in the literature or elsewhere, please cite that work.
- (not required for checkpoint) A graphs or table with the `raw’ measurements from the microbenchmarks. (For example, if you measure the time it takes to do something with varying sizes, a graph of the measurement of those sizes.) If your report is in text format, you may submit graphs or tables as seperate PDF, PNG, SVG, or HTML documents.
- The best guess at the value of each corresponding parameter from the microbenchmark. If you were not able to obtain the value of the parameter from your benchmark, explain why you interpret the results that way, what the likely cause(s) are, and how the benchmark might be improved.
- (not required for checkpoint) Comparison of the parameter value your benchmark suggsted to values published or exposed through CPUID-like interfaces. You may use an existing tool (such as x86info) to obtain this information. However you obtain it, identify your sources.

4 Notes

To read the time stamp counter (TSC), which is helpful for timing, from C with GCC or Clang on 64-bit x86, you can use:

#include <stdint.h> /* for uint64_t */
/* ... */
uint64_t read_tsc(void) {
    uint64_t hi, lo;
    /* 
     * Embed the assembly instruction 'rdtsc', which should not be relocated (`volatile').
     *  The instruction will modify r_a_x and r_d_x, which the compiler should map to
     * lo and hi, respectively.
     * 
     * The format for GCC-style inline assembly generally is:
     * __asm__ ( ASSEMBLY CODE : OUTPUTS : INPUTS : OTHER THINGS CHANGED )
     *
     * Note that if you do not (correctly) specify the side effects of an assembly operation, the
     * compiler may assume that other registers and memory are not affected. This can easily
     * lead to cases where your program will produce difficult-to-debug wrong answers when
     * optimizations are enabled.
     */
    __asm__ volatile ( "rdtsc" : "=a"(lo), "=d"(hi));
    return lo | (hi << 32);
}

If you are concerned about variation between cores, the RDTSCP instruction (only on sufficiently recent x86 processors; requires OS support) can return a timestamp and a core ID.
Note that on recent x86 processors, the time stamp counters count at a constant speed regardless of frequency scaling and other variation in core cycle time. There is an actual cycle counter, which with kernel support, you can arrange to read via the RDPMC instruction. You may consider libraries like PAPI to help in setting these up.

Because of prefetching, cache misses from access patterns that try to access memory in order may be completely hidden. Consider other access patterns.

5 References that might be useful

Computer Systems: A Programmer’s Perspective, Section 6.6
lmbench: Portable tools for performance analysis
Intel’s documentation about timing code
Intel® 64 and IA-32 Architectures Optimization Reference Manual
Software Optimization Guide for AMD Family 15h Processors and for AMD Family 16h Processors
Example source code for explicitly allocating `huge pages’ on Linux
man 2 madvise in Linux: lets you advise kernel to avoid automatically using huge pages or consider automatically using huge pages
Maurice et al., Reverse Engineering Intel Last-Level Cache Complex Addressing Using Performance Counters, 18th Int’l Symposium on Research in Attacks, Intrusions, and Defenses (RAID’15).
Mowery, Are AES x86 Cache Timing Attacks Still Feasible?
Documentation for using large pages on Windows
man 5 proc in Linux: includes documentation for the special file /proc/PID/pagemap which will provide a process’s virtual to physical page mapping.
man 2 mbind in Linux: on NUMA systems, lets you specify which NUMA region memory is allocated in.
Performance monitoring tools that interact with performance counters, by which processors can provide direct counts on cache misses, etc.:
- Linux’s perf
- PAPI