Task
- In general, this lab deals with vector instructions and their corresponding “intrinsic” functions. There are several sources for information on these:
- Below, there is a brief introduction to SIMD and the intrinsic functions, which should mostly duplicate the lecture material.
- We provide a reference for vector intrinsic functions you may find useful here
- The official Intel documentation provides a comprehensive list of the intrinsic functions. Our lab machines only support the “SSE” functions, so to use this reference only check the boxes labelled “SSE” through “SSE4.2”.
-
Download simdlab.tar and extract it. [Last updated 2018-04-12]
-
Read the brief introduction to SIMD below if you need a refresher of the lecture material on vector instructions.
- Read the explanation of an example SIMD function below. This includes an description of several things you will need for the next step:
- Edit
sum_benchmarks.c
to add a functionsum_SSE
that uses vector instrinstics in a very similar way tot he example SIMD function:- Start by making a copy the
sum_with_eight_accumulators
supplied with the tarball. - Change it to store the eight accumulators in one of the 128-bit registers rather than eight seperate registers, and use vector instructions to manipulate this register. You will primarily use the instrinsic
_mm_add_epi16
(add packed 16-bit integer values). - See the detailed explanation below.
Add
sum_SSE
to the list of benchmarks insum_benchmarks.c
. Compile it by runningmake
and then run./sum
to time it. - Start by making a copy the
-
Edit
min_benchmarks.c
to add a new functionmin_SSE
that does the same thing as the suppliedmin_C
function. You can use a strategy very similar to the one used forsum_SSE
, using the intrinsic function_mm_min_epi16
. See the detailed explanation and descriptions of useful intrinsic functions below below.Add
min_SSE
to the list of benchmarks inmin_benchmarks.c
, compile it withmake
, then run./min
to time it. -
(Optional) If you have time, then edit
dot_product_benchmarks.c
to create a vectorized version ofdot_product_C
calleddot_product_SSE
. See the detailed explanation below. -
(Optional) If you have time, then modify your
dot_product_benchmarks.c
to try to improve the performance of your dot product function using the advice below. - Submit whatever you have completed to archimedes.
Compatability note
OS X requires that function names have an additional leading underscore in assembly. So, the supplied assembly files (provided for comparison with good compiler-generated versions) will not work on OS X. The easiest thing to do is use Linux for the lab (either via SSH or via a VM). Alternately, you can:
-
edit the supplied assembly files
sum_clang5_O.s
anddot_product_gcc7_O3.s
to change the function name in them; or -
remove these functions from the list of benchmarks in
sum_benchmarks.c
,dot_product_benchmarks.c
.
SIMD introduction
In this lab, you will we use SIMD (Single Instruction Multiple Data) instructions, also known as vector instructions, available on our lab machines in order to produce more efficient versions of several simple functions.
Vector instructions act on vectors of values. For example
paddw %xmm0, %xmm1
uses 128-bit registers, %xmm0
and %xmm1
. But instead of adding two 128-bit integers together,
it treats each register has a “vector” of eight 16-bit integers and adds each pair of 16-bit integers.
The instructions we will be using in this lab are part of versions of Intel’s SSE (Streaming SIMD extensions)
to the x86 instruction set. (The book, on page 546, also talks about a later Intel extension, AVX (Advanced Vector eXtensions),
but these are not supported by our lab machines.)
Rather than writing assembly directly, we will have you use Intel’s Intrinsics functions
to do so. For example, to access the paddw
instruction from C, you will instead call the special
function _mm_add_epi16
. Each of these functions will compile into particular assembly instructions, allowing us to
specify that the special vector instructions should be used without also needing to write
all of our code in assembly.
You will create vectorized versions of three functions in the lab. (Since this is our first time offering this lab, we are not sure how long this will take to complete. If you don’t finish everything during the lab time, submit whatever you completed.)
We believe we have included all the information you need to complete this lab in this lab, but we also have a more reference-like explanation of the Intel intrinsics here.
A note on compiler vectorizations
Compilers can sometimes generate vector instructions automatically. The Makefile we have supplied in this lab has optimization settings where the compiler on our lab machines will not do this. We believe this will also be the case with other compilers, but we have not tested all of them.
The purpose of this lab is to familiarize you with how to use vector operations, so you can deal with more complicated problems where the compiler will not do a good enough job and understand what compilers are doing better.
General SIMD reference
We have tried to include the information about the Intel SSE intrinsic functions We provide a partial reference to the Intel SSE intrinsic functions here, which you may wish to refer to.
In addition, the official Intel documentation provides a comprehensive reference of all available functions. Note that our lab machines only support the “SSE” functions. So, when using the Intel page, only check the boxes labelled “SSE” through “SSE4.2”.
Example SIMD function
In this lab, you will be creating optimized versions of several functions that use vector instructions. To help you, we have an example created for you already:
add_benchmarks.c
– normal and vectorized version of an “add two arrays” function. Assumes the array sizes are multiples of 8
Compile this by running make
, then run it by running ./add
. You will see benchmark results for the two
versions of this add two arrays function.
One is this function:
void add_C(long size, unsigned short * a, const unsigned short *b) {
for (long i = 0; i < size; ++i) {
a[i] += b[i];
}
}
The other is a version that accesses vector instructions through special “intrinsic functions”:
/* vectorized version; assumes unsigned short is a short or unsigned short */
void add_SSE(long size, unsigned short * a, const unsigned short *b) {
for (long i = 0; i < size; i += 8) {
/* load 128 bits from a */
/* a_part = {a[i], a[i+1], a[i+2], ..., a[i+7]} */
__m128i a_part = _mm_loadu_si128((__m128i*) &a[i]);
/* load 128 bits from b */
/* b_part = {b[i], b[i+1], b[i+2], ..., b[i+7]} */
__m128i b_part = _mm_loadu_si128((__m128i*) &b[i]);
/* a_part = {a[i] + b[i], a[i+1] + b[i+1], ...,
a[i+7] + b[i+7]}
*/
a_part = _mm_add_epi16(a_part, b_part);
_mm_storeu_si128((__m128i*) &a[i], a_part);
}
}
New Types
An __m128i
represents a 128-bit value that can be stored on one of the special 128-bit %xmm
registers
on our lab machines. The i
indicates that the 128-bit value contains an array of integers. In this case,
they are 16-bit integers, but we can also work with other sized integers that fit in 128 bits.
Whenever we want to get or use a __m128i
value, we will use one of the special functions whose
name begins _mm
. You should not try to extract values more directly. (This may compile, but will probably not
do what you expect and may differ between compilers.)
We also have some functions that take a __m128i*
. This is a pointer to a 128-bit value which can
be loaded into a 128-bit register. When we cast &a[i]
to this type we are indicating that we mean “128 bits
starting with a[i]”. Since each element of a
is 16 bits, this means a[i]
up to and including a[i+7]
.
New “intrinisc” functions
To manipulate the 128-bit values we use several intrinsic functions:
-
__m128i _mm_loadu_si128(__m128i* ptr)
: loads 128-bits from in memory fromptr
. In this case, those 128-bits represent a vector of eight 16-bit values. For examplea_part
represents the vector{a[i], a[i+1], a[i+3], a[i+4], a[i+5], a[i+6], a[i+7]}
. -
__m128i _mm_add_epi16(__m128i x, __m128i y)
: treat the 128-bit values as a vector of 16-bit values, and add each pair. ifx
is the vector{x[0], x[1], x[2], x[3], x[4], x[5], x[6], x[7]}
andy
is the vector{y[0], y[1], y[2], y[3], y[4], y[5], y[6], y[7]}
, then the result is{x[0] + y[0], x[1] + y[1], x[2] + y[2], x[3] + y[3], x[4] + y[4], x[5] + y[5], x[6] + y[6], x[7] + y[7]}
. -
void _mm_storeu_si128(__m128i* ptr, __m128i value)
: store 128 bits into memory intoptr
.
Each of these functions will always be inlined, so we do not need to worry about function call overhead.
Most of the special of _mm
function will compile into exactly one instruction (as you can see below)
The epi16
part of some function names probably stands for “extended packed 16-bit”, indicating that it
manipulates a vector of 16-bit values.
Intrinsics and assembly
Typical assembly code generated for add_SSE
above looks like:
add_SSE:
// size <= 0 --> return
testq %rdi, %rdi
jle end_loop
// i = 0
movl $0, %eax
start_loop:
// __m128i b_part = _mm_loadu_si128((__m128i*) &b[i]);
movdqu (%rdx,%rax,2), %xmm0
// __m128i a_part = _mm_loadu_si128((__m128i*) &b[i]);
movdqu (%rsi,%rax,2), %xmm1
// a_part = _mm_add_epi16(a_part, b_part);
paddw %xmm1, %xmm0
// _mm_storeu_si128((__m128i*) &a[i], a_part)
movups %xmm0, (%rsi,%rax,2)
// i += 8
addq $8, %rax
// i < size --> return
cmpq %rax, %rdi
jg start_loop
end:
ret
(You can see the actual code in add_benchmarks.s
.)
(Various details will vary between compilers, and with some optimization settings, compilers might try to perform other optimizations, like loop unrolling.)
Each of the _mm_
functions corresponds directly to an assembly instruction:
_mm_loadu_si128
turns intomovdqu
_mm_add_epi16
turns intopaddw
_mm_storeu_si128
turns intomovups
Task 1: Sum with Intel intrisics
The first coding task is to create a version of sum
:
unsigned short sum_C(long size, unsigned short * a) {
unsigned short sum = 0;
for (int i = 0; i < size; ++i) {
sum += a[i];
}
return sum;
}
that uses vector instructions through the intrinsic functions.
Start by making a copy of the provided sum_with_eight_accumulators
that uses 8 accumulators:
unsigned short sum_with_eight_accumulators(long size, unsigned short *a) {
unsigned short partial_sum0 = 0, partial_sum1 = 0, partial_sum2 = 0, partial_sum3 = 0,
partial_sum4 = 0, partial_sum5 = 0, partial_sum6 = 0, partial_sum7 = 0;
for (int i = 0; i < size; i += 8) {
partial_sum0 = partial_sum0 + a[i+0];
partial_sum1 = partial_sum1 + a[i+1];
partial_sum2 = partial_sum2 + a[i+2];
partial_sum3 = partial_sum3 + a[i+3];
partial_sum4 = partial_sum4 + a[i+4];
partial_sum5 = partial_sum5 + a[i+5];
partial_sum6 = partial_sum6 + a[i+6];
partial_sum7 = partial_sum7 + a[i+7];
}
return partial_sum0 + partial_sum1 + partial_sum2 + partial_sum3 +
partial_sum4 + partial_sum5 + partial_sum6 + partial_sum7;
}
Rename this copy sum_SSE
.
Since the loop performs eight independent additions of 16-bit values, it can be changed
to use a single call to _mm_add_epi16
:
-
Instead of storing these eight 16-bit accumulators in separate variables, declare a single
__m128i
variable (perhaps calledpartial_sums
), which will contain all of their values. You can initialize it zero with something like:__m128i partial_sums = _mm_setzero_si128();
- Instead of loading
a[i+0]
througha[i+7]
seperately, call_mm_loadu_si128
to load them all into a single__m128i
variable. This may be identical to howa_part
is set inadd_SSE
above. - Instead of performing 8 additions, use one call to
_mm_add_epi16
withpartial_sums
anda_part
(or whatever you called these variables) -
After the loop, store the 8 partial sums in a temporary array on the stack using
_mm_storeu_si128
:unsigned short extracted_partial_sums[8]; _mm_storeu_si128((__m128i*) &extracted_partial_sums, partial_sums);
Then, add up these eight partial sums.
When you’ve completed this sum_SSE
function, add it to the list of functions in
sum_benchmarks.c
, then run make
to compile it. Then compare its performance to
the other versions using ./sum
.
Also examine the assembly code the compiler generated for your sum_benchmarks.c
in
sum_benchmarks.s
.
(It is also possible to perform the last 8 additions in parallel, without copying to the stack first, but for simplicitly and because it has a small effect on performance, we will not require that here.)
Task 2: Vectorized min
The next task is, using the same idea as you used to vectorize the sum
, create a vectorized version of this min function:
short min_C(long size, short * a) {
short result = SHRT_MAX;
for (int i = 0; i < size; ++i) {
if (a[i] < result)
result = a[i];
}
return result;
}
which you can find in min_benchmarks.c
. Create a new version of this that acts on __m128i
variables containing
eight elements of the array at a time. Some intrinsic functions that will be helpful (you can also
refer to our reference page or the Intel documentation):
__m128i _mm_setr_epi16(short a1, short a2, short a3, short a4, short a5, short a6, short a7, short a8)
— returns a__m128i
containing representing vector of signed 16-bit values.a1
will be the value that would be stored at the lowest memory address._mm_set1_epi16(short a)
— same as_mm_setr_epi16(a, a, a, a, a, a, a, a)
.-
_mm_min_epi16(a, b)
. Assumes thata
andb
contain a vector of eight 16-bit signed integers. Returns the minimums of each pair. For example:__m128i first = _mm_setr_epi16(-0x0100, 0x1000, 0x2000, 0x3000, 0x4000, 0x5000, -0x6000, 0x7000); __m128i second = _mm_setr_epi16(0x1000, 0x2000, 0x0100, 0x2000, 0x5000, 0x7FFF, -0x1000, -0x8000); __m128i result = _mm_min_epi16(first, second)
makes
result
contain{-0x0100, 0x1000, 0x0100, 0x2000, 0x4000, 0x5000, -0x6000, -0x8000}
. (-0x0100
is the minimum of the first elements-0x0100
and0x1000
;0x1000
is the minimum of the second elements0x1000
and0x2000
, and so on.)
After adding your vectorized function to min_benchmarks.c
and adding it to the list of functions, test it by running make
and then ./min
.
(optional) Task 3: Vectorize dot-product
Now let’s vectorize the following function:
unsigned int dot_product_C(long size, unsinged short *a, unsigned short *b) {
unsigned int sum;
for (int i = 0; i < size; ++i)
sum += a[i] * b[i];
return sum;
}
Note that this function computes its sums with unsigned int
s instead of unsigned short
s,
so you’ll need to add 32-bit integers
instead of 16 bit integers.
So, you will have 128-bit values which contain four 32-bit integers instead of eight 16-bit integers.
To obtain these originally, you’ll need to convert the 16-bit integers you read from the array into 32-bit integers; fortunately, there is an vector instruction (and intrinsic function) to do this quickly.
To manipulate these as 32-bit integers, you will use functions containing epi32
in their names instead epi16
name, which correspond
to different vector instructions.
Some intrinsic functions which may be helpful:
-
__m128i _mm_add_epi32(__m128i x, __m128i y)
is like_mm_add_epi16
but expects vectors of four 32-bit integers instead of eight 16-bit integers. -
__m128i _mm_setzero_si128()
returns an all-zeroes 128-bit value. When interpreted as a vector of integers of any size, it will be all 0 integers. -
__m128i _mm_cvtepu16_epi32(__m128i x)
ifx
contains a vector of eight 16-bit unsigned integers, take the first four and convert them into a vector of 32-bit integers. The rest are discarded. For example:unsigned short foo[8] = {1, 2, 3, 4, 5, 6, 7, 8}; unsigned int result[4]; __m128i foo_as_vector = _mm_loadu_si128((__m128i*) &foo[i]); __m128i foo_converted = _mm_cvtepu16_epi32(foo_as_vector); _mm_storeu_si128((__m128i*) &result[0], foo_converted);
makes
result
become{1, 2, 3, 4}
. -
_mm_mullo_epi32(x, y)
— like_mm_add_epi16
but multiply each pair of 32-bit values to produce a 64-bit value, and truncate each product to 32 bits.
Like you did with sum
, you can add up partial sums at the end by storing them in a temporary array on the stack.
Note that since you are adding vectors of four 32-bit values, your loop will probably act on four elements at a time (even though
you can load more at once). For now, don’t worry about loading 128-bits into an __m128i
and only using 64 bits of them.
After adding your vectorized function to dot_product_benchmarks.c
and adding it to the list of functions, test it for correctness by running make
and then ./dot_product
.
(It’s possible that your first vectorized version will be slower than the original because you are not using multiple accumulators. Although the vector instructions can perform more computations per unit time, they tend to have high latency.)
(optional) Task 4: Optimize the vectorized dot-product
Make a copy of your vectorized dot-product function and see how it is affected by applying various possible optimizations. Things you might try include:
- loop optimizations from the last lab, such as multiple accumulators.
- using different vector instructions. For example, based on some unofficial instruction timing tables, multiplying two 16-bit integers to get a 32-bit integer using
_mm_mullo_epi16
and_mm_mulhi_epi16
together (see reference links above) might be faster on some processors than_mm_mullo_epi32
. - trying to do any final summation using vector instructions.
See if you
can match or beat the performance of the supplied version of dot_product_C
compiled with GCC 7.2 with
the optimization flags -O3 -msse4.2
— or at least try to make it faster than the original plain C version, if it wasn’t.
If you are using your labtop, check if the performance difference on your laptop consistent with the lab machines.
Submission
Run make simdlab-submit.tar
to create an archive of your C files (called simdlab-submit.tar
) and upload it
to archimedes.
Please submit whatever you completed within lab time, even if it is less than the whole lab.