CS 3330: Looplab

This page is for a prior offering of CS 3330. It is not up-to-date.

1 Lab setup

In this lab, you will explore the effect of loop optimizations on a very simple loop:

unsigned short sum_C(long size, unsigned short * a) {
    unsigned short sum = 0;
    for (int i = 0; i < size; ++i) {
        sum += a[i];
    }
    return sum;
}

First, download the lab tarball here. Extract the tarball, it includes several notable files:

sum_benchmarks.c — file contaiing a list of versions of the above C code to time
sum_naive.s — a naive assembly implementation of the above C code.
sum_gcc5_O2.s, sum_gcc5_O3.s, sum_gcc7_O3.s, sum_clang5_O.s — versions of the above C code compiled with various compilers and various optimization flags
sum_main.c — main function that times several version of the above C code.
timing.c, timing.h — internal code used for timing
sum.h — various definitions used by our timing, sum code
Makefile — instructions to allow make to build the testing binary

Compatibility note: OS X requires that function names have an additional leading underscore in assembly. So, the supplied assembly files will not work on OS X. The easiest thing to do is use Linux for the lab (either via SSH or via a VM). Alternately, you can modify the assembly files to add an _ before the function names (e.g. changing sum_naive: to _sum_naive: and .global sum_naive to .global _sum_naive).

Run make to build the timing binary sum, then run it with

./sum

This will output times for 6 versions of the above C loop with various array sizes. The times will be shown in cycles/element.

sum_benchmarks.c controls the list of functions that are timed.

The 6 versions we have supplied are:

sum_C — This is the above code compiled on your machine. If you didn’t change the Makefile, on the lab machines, this will use GCC version 4.9 with the options -O2 -msse4.2. On your machine it will use whatever compiler gcc is with -O2 -msse4.2
sum_naive — This is the assembly code in sum_naive.s. This is a simple assembly implementation that does not attempt to do any loop optimizations.
sum_gcc5_O2 — This is the above C code compiled with GCC version 5.4.1 with the option -O2 (with the function renamed). The assembly code is in sum_gcc5_O2.s
sum_gcc5_O3 — This is the above C code compiled with GCC version 5.4.1 with the option -O3 (with the function renamed). The assembly code is in sum_gcc5_O3.s
sum_clang5_O — This is the above C code compiled with Clang version 5.0.0 with the options -O -msse4.2 (with the function renamed). The assembly code is in sum_clang5_O.s
sum_gcc7_O3 — This is the above C code compiled with GCC version 7.2 with the options -O3 -msse4.2 (with the function renamed). The assembly code is in sum_gcc7_O3.s

2 Loop optimization

Your task in this lab is to implement several loop optimizations and see how they perform. For most of these, you will be creating assembly files of the optimizations. This will make it easy to isolate the effect of these optimizations from other optimizations the compiler might be performing.

2.1 Loop unrolling in assembly

Create a copy of sum_naive.s called sum_unrolled2.s. Rename the function from sum_naive to sum_unrolled2 and modify it to unroll the loop 2 times. To make things easier, you may assume that the size is a multiple of 16.

Then, modify sum_benchmarks.c to add sum_unrolled2 to sum_benchmarks.c. (You will need to add it to functions in addition to writing its prototype.) Run make to recompile the sum program and see how your function performs.

2.1.1 More loop unrolling

Then, repeat this process to create an sum_unrolled4.s which unrolls the loop 4 times, and a sum_unrolled8.s which unrolls the loop 8 times.

Add these new functions to sum_benchmarks.c and see how they perform.

2.2 Multiple accumulators in assembly

Copy your sum_unrolled8.s into an sum_accums.s. Rename the function from sum_unrollled8 to sum_multiple_accum and change it to use multiple accumulators (at least 2) instead of just using %rax. For example, you might add every even element to %rax and every odd element to %rbx; then, after the loop, you would add the two partial sums in %rbx and %rax.

Add this new function to sum_benchmarks.c and see how they perform.

2.3 Unroll and multiple accumulators in C

Open sum_benchmarks.c and create a copy of the function sum_C called sum_accums_C. In this copy, write a C version of your multiple accumulators code. Compare its performance to the version you wrote in assembly. Look at the assembly code in sum_benchmarks.s to see if the compiler did any additional optimizations.

2.4 (optional) Changing optimization options

Also try changing Makefile where it says CFLAGS = to pass different optimization options to your compiler, or changing what compiler is used by changing CC =. You will need to run make clean and then make after doing this to rebuild the sum program with the new option.

See if your compiler will perform loop unrolling or use multiple accumulators.

On the lab machines, with GCC interesting options include:

-O0 (no optimizations)`
-O1 (optimization level 1)
-O3 (optimization level 3)
-Os (optimize for size)
-march=native (build for this particular machine, even if that makes it incompatible with other machines; combine this with other optimization options)

On the lab machines, at least 4 C compilers are installed:

gcc-4.8
gcc-4.9
gcc-5
clang-3.9

3 Submission

Run make looplab-submit.tar to create an archive of your .s and .c files. Upload the result to archimedes.

4 Appendix: Timing

4.1 Cycle counters

The timing code we have supplied uses the rdtsc (ReaD Time Stamp Counter) instruction to measure the performance of the function. Historically, this accessed a counter of the number of processor clock cycles. On current generation processors, where different processor cores have different clock rates and clock rates vary to save power, that is no longer how rdtsc works. On modern systems, rdtsc accesses the number of cycles of a counter that counts at a constant rate regardless of the actual clock speeds of each core. This means that the cycle counter reliably measures wall clock time rather than actually measuring the number of cycles taken.

Since clock rates vary on modern processors, measurements of wallclock time do not have an obvious correlation to number of clock cycles. A particular problem are processor features like Intel Turbo Boost or AMD Turbo Core. (These might generally be called dynamic overclocking.) In these cases, processor cores briefly operate at faster than the normal maximum clock rate. This means that microbenchmarks like ours my make the processor appear faster than it would be under normal operation — e.g., if we needed to compute sums repeatedly over a period of time. The cycle counter generally counts clock cycles at the normal sustained clock rate.

4.2 Taking minimums

The function tries to give the approximate minimum timing, ignoring temporary effects like moving arrays into cache or other things running on the system. To do this, it runs the function until:

It has run for at least 250 million cycles; and
Either:
- The 10 shortest times are within .5% of each other; or
- 200 attempts have been made.

It then returns the 5th shortest time (ordinarily within .5% of the shortest time).