This page is for a prior offering of CS 3330. It is not up-to-date.
Changelog:
0x80
arguments to _mm_setr_epi8
from example for _mm_shuffle_epi8
._mm_set_epi8
, etc._mm_shuffle_epi6
._mm_shuffle_epi6
. This semester we are experimenting with having students use SIMD (single instruction multiple data) instructions in several assignments. These are sets of instructions that operate on wide registers called vectors. For our assignments, these vectors will be 128 bits wide. Typically, instructions that act on these wide registers will treat it as an array of values. They will then perform operations independently on each value in the array. In hardware, this can be implemented by having multiple ALUs that work in parallel. As a result, although these instructions perform many times more arithmetic than normal
instructions, they can be as fast as the normal instructions.
Generally, we will be accessing these instructions using intrinsic functions
. These functions usually correspond directly to a particular assembly instruction. This will allow us to write C code that accesses this special functionality consistently without losing all the benefits of having a C compiler.
The intrinsic functions we will be using are an interface defined by Intel. Consequently, Intel’s documentation, which can be found here is the comprehensive reference for these functions. Note that this documentation includes functions corresponding to instructions which are not supported on lab machines. To avoid seeing these be sure to check only the boxes labelled SSE
through SSE4.2
on the side.
To use the intrinsic functions, you need to include the appropriate header file. For the intrinsics we will be using this is:
#include <immintrin.h>
To represent 128-bit values that might be stored in one of these registers in C, we will use one of the following types:
__m128
(for four floats)__m128d
(for two doubles)__m128i
(for integers, no matter the size)Since each of these is just a 128-bit value, you can cast between these types if a function you want to use expects the wrong
type of value. For example, you might want to use a function meant to load floating values to load integers. Internally, the functions that expect these types just manipulate 128-bit values in registers or memory.
If you want to load a constant in a 128-bit value, you need to use one of the intrinisc functions. Most easily, you can use one of the functions whose name starts with _mm_setr
. For example:
__m128i values = _mm_setr_epi32(0x1234, 0x2345, 0x3456, 0x4567);
makes values
contain 4 32-bit integers, 0x10
, 0x20
, 0x30
, and 0x40
. We can then extract each of these integers by doing something like:
int first_value = _mm_extract_epi32(values, 0);
// first_value == 0x1234
int second_value = _mm_extract_epi32(values, 1);
// second_value == 0x2345
To load an array of values from memory or store an array of values to memory, we can use the intrinsics starting with _mm_loadu
or _mm_storeu
:
int arrayA[4];
_mm_storeu_si128((__m128i*) arrayA, values);
// arrayA[0] == 0x1234
// arrayA[1] == 0x2345
// ...
int arrayB[4] = {10, 20, 30, 40};
values = _mm_loadu_si128((__m128i*) arrayB);
// 10 == arrayB[0] == _mm_extract_epi32(values, 0)
// 20 == arrayB[1] == _mm_extract_epi32(values, 1)
// ...
To actually perform arithmetic on values, there are functions for each of the supported mathematical operations. For example:
__m128i first_values = _mm_setr_epi32(10, 20, 30, 40);
__m128i second_values = _mm_setr_epi32( 5, 6, 7, 8);
__m128i result_values = _mm_add_epi32(first_values, second_values);
// _mm_extract_epi32(result_values, 0) == 15
// _mm_extract_epi32(result_values, 1) == 26
// _mm_extract_epi32(result_values, 2) == 37
// _mm_extract_epi32(result_values, 3) == 48
The examples treat the 128-bit values as an array of 4 32-bit integers. There are instructions that treat in many different types of values, including other sized integers or floating point numbers. You can usually tell which type is expected by the presence of a something indicating the type of value in the function names. For example, epi32
represents 4 32-bit values
. (The name stands for extended packed integers, 32-bit
.) Some other conventions in names you will see:
si128
– signed 128-bit integerepi8
, epi32
, epi64
— 16 signed 8-bit integers or 4 signed 32-bit integers or 2 64-bit integersepu8
— 16 unsigned 8-bit integers (when there is a difference between what an operation would do with signed and unsigned numbers, such as with conversion to a larger integer or multiplication)epu16
, epu32
— 8 unsigned 16-bit integers or 4 unsigned 32-bit integers (when the operation would be different than signed)ps
— packed single— 4 single-precision floats
pd
— packed double— 2 doubles
ss
— one float (only 32-bits of a 128-bit value are used)sd
— one double (only 64-bits of a 128-bit value are used)The following two C functions are equivalent
int add_no_SSE(int size, int *first_array, int *second_array) {
for (int i = 0; i < size; ++i) {
first_array[i] += second_array[i];
}
}
int add_SSE(int size, int *first_array, int *second_array) {
int i = 0;
for (; i + 4 <= size; ++i) {
// load 128-bit chunks of each array
__m128i first_values = _mm_loadu_si128((__m128i*) &first_array[i]);
__m128i second_values = _mm_loadu_si128((__m128i*) &second_array[i]);
// add each pair of 32-bit integers in the 128-bit chunks
first_values = _mm_add_epi32(first_values, second_values);
// store 128-bit chunk to first array
_mm_storeu_si128((__m128i*) &first_array[i], first_values);
}
// handle left-over
for (; i < size; ++i) {
first_array[i] += second_array[i];
}
}
_mm_add_epi32(a, b)
— treats its __m128i
arguments as 8 32-bit integers. If a
contains the 32-bit integers a0, a1, a2, a3
and b
contains b0, b1, b2, b3
, returns a0 + b0, a1 + b1, a2 + b2, a3 + b3
. (Corresponds to the paddd
instruction.)_mm_add_epi16(a, b)
— Same as _mm_add_epi32
but with 16-bit integers. If a
contains the 16-bit integers a0, a1, ..., a7
and b
contains b1, b2, ..., b7
, returns a0 + b0, a1 + b1, ..., a7 + b7
. (Corresponds to the paddw
instruction.)_mm_add_epi8(a, b)
— Same as _mm_add_epi32
but with 8-bit integers._mm_mullo_epi16(x, y)
: treats x and y as a vector of 16-bit signed integers, multiplies each pair of integers, and truncates the results to 16 bits._mm_mulhi_epi16(x, y)
: treats x and y as a vector of 16-bit signed integers, multiplies each pair of integers to get a 32-bit integer, then returns the top 16 bits of each 32-bit integer result._mm_srli_epi16(x, N)
: treat x
and a vector of 16-bit signed integers, and return the result of logically shifting each right by N
. (There is also a epi32
and epi64
variant for 32 or 64-bit integers.)_mm_slli_epi16(x, N)
: treat x
and a vector of 16-bit signed integers, and return the result of shifting each left by N
. (There is also a epi32
and epi64
variant for 32 or 64-bit integers.)_mm_hadd_epi16(a, b)
— (horizontal add) treats its
__m128i
arguments as vectors of 16-bit integers. If a
contains a0, a1, a2, a3, ..., a7
and b
contains b0, b1, b2, b3, ..., b7
, returns a0 + a1, a2 + a3, a4 + a5, a6 + a7, b0 + b1, b2 + b3, b4 + b5, b6 + b7
. Note that this is often substantially slower than _mm_add_epi16
. (Corresponds to the phaddw
instruction.)_mm_loadu_si128
, _mm_storeu_si128
— load 128 bits to or from memory. (Corresponds to the movdqu
instruciton.) Note that you can use _mm_storeu_si128
to store into a temporary array as in:
unsigned short values_as_array[8];
__m128i values_as_vector;
_mm_storeu_si128((__m128i*) &values_as_array[0], values_as_vector);
_mm_storel_epi64
— store the first 64-bits of a vector in memory. Example usage:
unsigned short first_four_values_as_array[4];
__m128i values_as_vector;
_mm_store_epi64((__m128i*) &values_as_array[0], values_as_vector);
(Although this takes a poiner to __m128i
, a 128-bit value, it only writes 64 bits.)
_mm_store_ss
— store the first 32-bits of a vector in memory. The function prototype assumes that you are dealing with floats, so it expects a float pointer and a __m128
representing the vector. But the instruction generated doesn’t care what the bits of the vector represent. Example usage:
unsigned short first_two_values_as_array[2];
__m128i values_as_vector;
_mm_store_ss((float*) &values_as_array[0], (__m128) values_as_vector);
_mm_setr_epi32
— returns a __m128i
value containing the specified 32-bit integers. The first integer argument will be in the part of the __m128i
that has the lowest address when written to memory. For example:
__m128i value1 = _mm_setr_epi16(0, 1, 2, 3);
produces the same result in value1
as in value2
in
int array[8] = {0, 1, 2, 3};
__m128i value2 = _mm_loadu_si128((__m128i*) &array[0]);
_mm_setr_epi16
— same as _mm_setr_epi32
but with 16-bit integers. For example:
__m128i value1 = _mm_setr_epi16(0, 1, 2, 3, 4, 5, 6, 7);
produces the same result in value1
as in value2
in
short array[8] = {0, 1, 2, 3, 4, 5, 6, 7};
__m128i value2 = _mm_loadu_si128((__m128i*) &array[0]);
_mm_setr_epi8
— same as _mm_setr_epi32
but with 8-bit integers.
_mm_set1_epi32
, _mm_set1_epi16
, _mm_set1_epi8
— return a __m128i
value representing an array of values of the appropriate size, where each element of the array has the same value. For example:
__m128i value = _mm_set1_epi16(42);
has the same effect as:
__m128i value = _mm_setr_epi16(42, 42, 42, 42, 42, 42, 42, 42);
_mm_set_epi8
, etc. — same as _mm_setr_epi8
, etc. but takes its arguments in reverse order
_mm_extract_epi32(a, index)
extracts the index
’th 32-bit integer from a
. The integer with index 0 is the one that will be stored at the lowest memory address if a
is copied to memory. (Corresponds to the pextrd
instruction.)_mm_extract_epi16(a, index)
is same as _mm_extract_epi32
but with 16-bit integers_mm_cvtsi128_si32(a)
has the same effect as _mm_extract_epi32(a, 0)
, but might be faster._mm_cvtepu8_epi16(eight_bit_numbers)
: converts the first 8 of 16 8-bit unsigned integers into a vector of 8 16-bit signed integers. For example:
__m128i value1 = _mm_setr_epi8(10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150);
__m128i value2 = _mm_cvtepu8_epi16(value1);
results in value2 containing the same value as if we did:
__m128i value2 = _mm_setr_epi16(10, 20, 30, 40, 50, 60, 70, 80);
_mm_shuffle_epi8(a, mask)
rearrange the bytes of a
according to mask
and return the result. mask
is a vector of 8-bit integers (type __m128i
) that indicates how to rearrange each byte:
For example:
__m128i value1 = _mm_setr_epi8(10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160);
__m128i mask = _mm_setr_epi8(0x80, 0x80, 0x80, 5, 4, 3, 0x80, 7, 6, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80);
__m128i value2 = _mm_shuffle_epi8(value1, mask);
should produce the same result as:
__m128i value2 = _mm_setr_epi8(0, 0, 0, 60, 50, 40, 0, 80, 70, 0, 0, 0, 0, 0, 0, 0, 0);
/* e.g. since 3rd element of mask is 5, 3rd element of output is 60, element 5 of the input */
The instruction
paddd %xmm0, %xmm1
takes in two 128-bit values, one in the register %xmm0
, and another in the register %xmm1
. Each of these registers are treated as an array of two 64-bit values. Each pair of 64-bit values is added together, and the results are stored in %xmm1
.
For example, if %xmm0
contains the 128-bit value (written in hexadecimal):
0x0000 0000 0000 0001 FFFF FFFF FFFF FFFF
and %xmm1
contains the 128-bit value (written in hexadecimal):
0xFFFF FFFF FFFF FFFE 0000 0000 0000 0003
Then %xmm0
would be treated as containing the numbers 1
and -1
(or 0xFFFFFFFFFFFFFFFF
), and %xmm1
as containing the numbers -2
and 3
. paddd
would add 1
and -2to produce -1 and
-1and
3to produce 2, so the final value of
%xmm1` would be:
0xFFFF FFFF FFFF FFFF 0000 0000 0000 0002
If we interpret this value as an array of two 64-bit integers, then that would be -1
and 2
.