Changelog:
- 2 Nov 2023: state that one of the non-required map strategies should be measured; adjust optional labels to something else
- 6 Nov 2023: suggest different file on portal to test with since what’s installed on portal changed since last semester
- 6 Nov 2023: add cast in starting code to quash signedness changing warning
- 6 Nov 2023: expand explanation of
#pragma omp parallel for
to separate it into#pragma omp parallel
and#pragma omp for
- 6 Nov 2023: reorganize explanation of parallel for schedules to more clearly separate options
- 8 Nov 2023: on task queue with larger task, correct indices to work on to reflect what += returns
1 Task
Work, ideally with a paterner, to use OpenMP to parallelize the Starter code. You should be able to understand
all of the code provided, but will only need to work on the
geomean
function for this lab.
To run the code, compile with the -lm
flag and give it
file names as command-line arguments. It will return the number of bytes
in those files and the geometric mean value of those bytes. You can
provide multiple files on the command line if you wish; all will be
combined before analysis. Parallelism give the biggest benefit when
given large inputs, so you might want to try some larger files: for
example, on portal /usr/bin/emacs-gtk
has more than 6
million characters.
$ gcc -fopenmp -O2 openmpstarter.c -lm
$ ./a.out /usr/bin/emacs-gtk
40731819 ns to process 6306696 characters: 4.75838
To parallelize, separate the code into Map and Reduce steps and try several OpenMP parallelization strategies described below; keep track of which was best.
For full credit on submission, you must measure all the strategies
not labeled not required
or extra
below, plus one of the
extra strategies for map.
For full credit for an in-person checkoff, we may accept somewhat less being measured if you run out of time in the lab.
When timing:
- make sure the answer (
4.75838
in the output example above) does not change substantially because of our optimizations; and - take multiple measurements so other activity on the machines doesn’t dramatically affect your results
Then either:
- submit your code (preferably as one file with multiple copies of the
geomean function, including all versions you timed) and a text file
performance.txt
explaining which was fastest and your guess as to why; or - check-off with a TA, being prepared to explain which approach was fastest and your guess as to why
2 Introduction
OpenMP is designed to make common kinds of speed-oriented parallelism
simple and painless to write. It uses a special set of C pre-processor
directives called pragmas
to annotate common programming
constructs and have then transformed into threaded versions. It also
automatically picks a level of parallelism that matches the number of
cores available.
In this writeup we describe all the OpenMP functionality you should need. However, you may also find it useful to refer to (for example):
- Mattoson and Meadow’s OpenMP introduction (slidedeck) from the Supercomputing ’08 conference
- the official OpenMP reference guide,
- Lawrence Livermore National Labroatory’s OpenMP tutorial,
2.1 Set-up
Your compiler has to be built with OpenMP support for OpenMP to work.
On portal.cs.virginia.edu
, the GCC compiler has been built
that way but the CLang compiler has not, so you’ll need to use gcc
, not clang
, to compile
code for this lab if you use portal. Note this requires module load gcc
.
To verify your set-up works, try the following
test.c
:
#include <stdio.h>
#include <omp.h>
int main() {
#pragma omp parallel
("I'm a thread");
puts("non-threaded");
puts}
Compile with gcc -fopenmp -O2 test.c
and run as ./a.out
; if it
worked, you’ll see multiple I'm a thread
lines printed
out.
If you compile without the -fopenmp
flag it will run on
just one thread and print our I'm a thread
just once.
Note that an OpenMP #pragma
applies to the subsequent
statement, only. Thus in the above puts("I'm a thread");
is
threaded but puts("non-threaded");
is not. If you want
several statements to be parallelized, you’d put them in braces to
combine them into one block statement, as e.g.
#pragma omp parallel
{
("I'm a thread");
puts("Also a thread");
puts}
("but not this one"); puts
3 The Map-Reduce pattern
Data races are a major limiting factor on parallel computation. Synchronization can remove races, but at the cost of reduced parallelism. Several programming patterns can avoid these problems; one of the most popular is the map-reduce paradigm.
Map-reduce works as follows
Map: Turn an array of input values into an array of output values. Ensure that each output value is independent of the others.
Reduce: combine an array of values into a single value.
In both cases, the array
might be implicitly defined; for
example, the array of integers from 0 to 10000 could be implicit in the
existence of a for loop.
Many problems that take large amounts of data as input can be re-posed as a map, followed by a reduce. Both Map and Reduce can be efficiently parallelized. Parallelizing the map operation is relatively simple. Parallelizing the reduce operation efficiently is more complex, but efficient implementations of the reduction operation can be reused.
In your implementations, you will combine a way of parallelizing map with a way of parallelizing reduce.
3.1 Parallel Map
There are two main ways to parallelize Map: one that is easy but assumes every piece of work is of similar difficulty and another that is a bit trickier but allows for some tasks to be much harder than others efficiently.
3.1.1 Even split
If your serial code looks like
for(int i=0; i<N; i+=1) {
// ...
}
then the parallel code
#pragma omp parallel
{
#pragma omp for
for(int i=0; i<N; i+=1) {
// ...
}
}
which can also be more succintly written
#pragma omp parallel for
for(int i=0; i<N; i+=1) {
// ...
}
works my splitting up the 0
–N
range into a
number of chunks equal to the number of threads. For example, if it uses
4 threads then they will get the following range of i
:
0
–N/4
N/4
–N/2
N/2
–3*N/4
3*N/4
–N
To turn this strategy into a complete solution, you’ll need to
combine it with one of the reduce
parallelization techniques.
OpenMP pragmas used:
#pragma omp parallel
meansrun the next statement/block simulatenously in separate threads
#pragma omp for
meansassuming, we’re running the next for loop in parallel, run it with different bounds in every thread
#pragma omp parallel for
meanshave each thread run the next statement (a
; it’s a short-hand combining omp parallel and omp forfor
loop) with different bounds
3.1.2 Task queue
If your serial code looks like
for(int i=0; i<N; i+=1) {
// work on item i
}
then the parallel code
int j = 0;
#pragma omp parallel
while (1) {
int i;
#pragma omp atomic capture
= j++;
i if (i >= N) break;
// work on item i
}
has each thread atomically update a shared counter until it gets too
big. This ensures that if some threads take a long time in the // work on item i
of
the other threads can still progress. It has more overhead than the
previous method, though, as atomic actions are slower than non-atomic
actions.
(As mentioned below this can also be done with the
schedule
option to the parallel for
construct,
but we would like you to try first with the lower-level tools to better
understand what is going on.)
OpenMP pragmas used:
#pragma omp parallel
meanshave each thread run the next statement (the
.while
loop)#pragma omp atomic capture
meansthe next statement (
. There are other atomic statement types; see https://www.openmp.org/spec-html/5.0/openmpsu95.html.i = j++
) is a capture-type statement and needs to run atomically
3.1.3 (extra) Task queue with larger tasks
A variant of the task queue method to reduce the impact of the
performance the atomic operation might be to have threads increment the
shared counter by some value K
(instead of 1
)
and do work for K
items instead of 1:
int j = 0;
#pragma omp parallel
while (1) {
int i;
#pragma omp atomic capture
= j += K;
i if (i - K >= N) break;
// work on item i - K through item min(i, N)
}
3.1.4 (extra) OpenMP’s parallel for schedules
OpenMP’s parallel for
construct supports a
schedule
option where rather than dividing up the threads
statically in as large chunks as possible, it can use another
strategy.
(For timing this, it’s sufficient to choose one of the schedules options below, though you may find it interesting to examine multiple if you have time.)
The schedules that you choose from include:
a
guided
schedule, which is like task queues with larger tasks, except it varies the size of task each thread gets automatically rather than having you choose a single value:#pragma omp parallel for schedule(guided) for(int i=0; i<N; i+=1) { // ... }
a
dynamic
schedule, which acts like task queues with fixed-sizes tasks. Code like the following will do something similar to theTask queue
strategy:#pragma omp parallel for schedule(dynamic) for(int i=0; i<N; i+=1) { // ... } ``` and code like the following will use something similar to the "Task queue with larger tasks" strategy:
#pragma omp parallel for schedule(dynamic, K) for(int i=0; i<N; i+=1) { // ... } ```
where
K
is some integer constant of the number of iterations to do as atask
, which OpenMP’s documentation called thechunk size
.the default
static
schedule, which can take a parameter of how many iterations to assign to threads at a time (instead of trying to do as largechunks
as possible). For example#pragma omp paralle for schedule(static, 10) for(int i=0; i<N; i+=1) { // ... }
with two threads would:
- assign 0..9, 20..29, 40..49, etc. (up to N) to thread 1
- assign 10..19, 30..39, etc. (up to N) to thread 2
instead of assigning 0..(N/2) and (N/2) to N-1.
3.2 Parallel Reduce
There are two main ways to parallelize Reduce: either use an atomic operation or do partial reductions in each of several threads and then combine them all in oen thread afterwards.
3.2.1 Atomic reduction – non-parallel
The simplest way to reduce is to make the reduction step an #pragma omp atomic
of some type (usually update
). This limits the performance
value of parallelism, so it’s not recommended in general, but in some
cases it is adequate to achieve a needed speedup.
This is done by replacing
= zero_for_my_type;
my_type result for(int i=0; i<N; i+=1) {
= array[i];
result op}
(where op=
is some kind of augmented assignment, like
+=
or *=
) with (assuming the mapping step uses
the even split
strategy):
= zero_for_my_type;
my_type result # pragma omp parallel for
for(int i=0; i<N; i+=1) {
#pragma omp atomic update
= array[i];
result op}
If the mapping step did not use the even split strategy, we would replace the for loop with another looping construct accordingly. For example, with the task queue map strategy mentioned above, we’d end up with code like:
= zero_for_my_type;
my_type result int j = 0;
#pragma omp parallel
while (1) {
int i;
#pragma omp atomic capture
= j++;
i if (i >= N) break;
#pragma omp atomic update
= array[i];
result op}
Note that since the bulk of the operation is atomic, it runs in a mostly serial fashion. However, the array lookup and loop indexing can be done in parallel, so it might still have some value. That value can be increased if it is merged with the mapping loop into a single parallel for loop, reducing threading overhead.
OpenMP pragmas used:
#pragma omp parallel for
meanshave each thread run the next statement (a
.for
loop) with different bounds#pragma omp atomic update
meansthe next statement (
. There are other atomic statement types; see https://www.openmp.org/spec-html/5.0/openmpsu95.html.result op= local_result
) is an update-type statement and needs to run atomically
3.2.2 Many-to-few reduction – atomic version
An alternative approach is to to reduce an array of N values into an array of n values, where n is the number of threads; then have one thread further reduce the n to 1.
Given reduction code
= zero_for_my_type;
my_type result for(int i=0; i<N; i+=1) {
= array[i];
result op}
(where op=
is some kind of augmented assignment, like
+=
or *=
) then the parallel code (assuming
this reduction strategy is combined with an even split
mapping
strategy)
= zero_for_my_type;
my_type result #pragma omp parallel
{
= zero_for_my_type;
my_type local_result #pragma omp for nowait
for(int i=0; i<N; i+=1) {
= array[i];
local_result op}
#pragma omp atomic update
= local_result;
result op}
will have each thread to its share of reductions on its own local copy of the result, then atomically update the shared total
(If this was combined with a different mapping strategy, then we would replace the for loop above with a different loop structure.)
OpenMP pragmas used:
#pragma omp parallel
meanshave each thread run the next statement (the block in
.{ ... }
)#pragma omp for nowait
meansthe next statement (a
It’s like thefor
loop) should have its bounds adjusted depending on which thread is running it.#pragma omp parallel for
discussed under the Even split section above, but instead of creating new threads it uses those already existing.The
nowait
means if one thread finishes before another, it can move on to post-for
-loop code without waiting for the others to finish.#pragma omp atomic update
meansthe next statement (
. There are other atomic statement types; see https://www.openmp.org/spec-html/5.0/openmpsu95.html.result op= local_result
) is an update-type statement and needs to run atomically
3.2.3 (not required) Many-to-few reduction – array version
If your reduction step is more than a single update operation, a more complicated solution is needed. See the Appendix for more.
4 Starting code
#include <stdio.h> // fopen, fread, fclose, printf, fseek, ftell
#include <math.h> // log, exp
#include <stdlib.h> // free, realloc
#include <time.h> // struct timespec, clock_gettime, CLOCK_REALTIME
#include <errno.h>
// computes the geometric mean of a set of values.
// You should use OpenMP to make faster versions of this.
// Keep the underlying sum-of-logs approach.
double geomean(unsigned char *s, size_t n) {
double answer = 0;
for(int i=0; i<n; i+=1) {
if (s[i] > 0) answer += log(s[i]) / n;
}
return exp(answer);
}
/// nanoseconds that have elapsed since 1970-01-01 00:00:00 UTC
long long nsecs() {
struct timespec t;
(CLOCK_REALTIME, &t);
clock_gettimereturn t.tv_sec*1000000000 + t.tv_nsec;
}
/// reads arguments and invokes geomean; should not require editing
int main(int argc, char *argv[]) {
// step 1: get the input array (the bytes in this file)
char *s = NULL;
size_t n = 0;
for(int i=1; i<argc; i+=1) {
// add argument i's file contents (or string value) to s
FILE *f = fopen(argv[i], "rb");
if (f) { // was a file; read it
(f, 0, SEEK_END); // go to end of file
fseeksize_t size = ftell(f); // find out how many bytes in that was
(f, 0, SEEK_SET); // go back to beginning
fseek= realloc(s, n+size); // make room
s (s+n, 1, size, f); // append this file on end of others
fread(f);
fclose+= size; // not new size
n } else { // not a file; treat as a string
= 0; // clear the read error
errno }
}
// step 2: invoke and time the geometric mean function
long long t0 = nsecs();
double answer = geomean((unsigned char*) s,n);
long long t1 = nsecs();
(s);
free
// step 3: report result
("%lld ns to process %zd characters: %g\n", t1-t0, n, answer);
printf}
5 Appendix: Fancy Reduce
If the reduction operations is more complicated than a single atomic operation can support, we can store the threads’ intermediate results in an array.
Given reduction code
= zero_for_my_type;
my_type result for(int i=0; i<N; i+=1) {
// arbitrary code to add array[i] to result
}
then the parallel code
// find out how many threads we are using:
#ifdef OPENMP_ENABLE
#pragma omp parallel
#pragma omp master
int threads = omp_get_num_threads();
#else
int threads = 1;
#endif
*results = (my_type *)malloc(threads * sizeof(my_type));
my_type
#pragma omp parallel
{
#ifdef OPENMP_ENABLE
int myid = omp_get_thread_num();
#else
int myid = 0;
#endif
[myid] = zero_for_my_type;
results#pragma omp for nowait
for(int i=0; i<N; i+=1) {
// arbitrary code to add array[i] to results[myid]
}
}
= zero_for_my_type;
my_type result for(int i=0; i<threads; i+=1) {
// arbitrary code to add results[i] to result
}
will have each thread to its share of reductions on its own local copy of the result, then have one thread update them all.
false sharing
The code above above can have problems with false sharing.
Although each thread accesses an independent results[i]
value, those values will often be in the same cache block. With mulitple
cores, each core’s cache will need to take turns holding the cache
blocks value, and so will not be able to work in parallel as much one
might expect.
This problem can be avoided by making the results[i]
my_type
take up a whole cache block or otherwise spacing
out the results
array more.
OpenMP pragmas, macros, and functions used:
#pragma omp master
meansonly one thread (called the master thread) gets to run this
.#ifdef OPENMP_ENABLE
meansonly if
-fopenmp
was provided at compile timeomp_get_num_threads()
returns the number of threads OpenMP has running.omp_get_thread_num()
returns the index of this thread (between 0 and the number of threads)#pragma omp parallel
meansrun the next statement in multiple threads
#pragma omp for nowait
meansthe next statement (a
It’s like thefor
loop) should have its bounds adjusted depending on which thread is running it.#pragma omp parallel for
discussed under the Even split section above, but instead of creating new threads it uses those already existing.The
nowait
means if one thread finishes before another, it can move on to post-for
-loop code without waiting for the others to finish.