John D. McCalpin Wrote:
>
> > What, if any, are your preferences for obtaining parallel results?
>
> I greatly prefer automatically parallelized results. If that
> is not possible, my second choice is manually parallelized
> results. Failing that, I accept aggregate results from
> the Parallel_jobs script, which is in the same directory as
> the rest of the STREAM code.
Greetings again,
For the new SPP1600, here are times for manually parallelized
streams (source and makefile attached for your perusal). I am curious
as to what your policy is regarding clusters and/or message passing.
Clusters of workstations can usually scale without bound on highly
parallel codes.
Each result is from a single process (using lightweight threads of course)
and shared memory. If you have any questions, etc. please don't hesitate
to send me email. I hope these results are suitable for your WEB page.
Thanks,
Isom Crawford
HP Convex Technology Center
Detailed output follows (long):
================================== 1 CPU ==================================
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 25000000, Offset = 0
Total memory required = 572.2 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity/precision appears to be 28 microseconds.
Each test below will take on the order of 3223797 microseconds.
(= 115135 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING: The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Spawning 1 threads.
Function Rate (MB/s) RMS time Min time Max time
Assignment: 123.1902 3.3539 3.2470 4.1907
Scaling : 121.1729 3.4095 3.3011 4.2600
Summing : 156.2537 3.8406 3.8399 3.8411
SAXPYing : 156.1776 3.8426 3.8418 3.8432
================================== 4 CPU ==================================
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 25000000, Offset = 0
Total memory required = 572.2 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity/precision appears to be 28 microseconds.
Each test below will take on the order of 2881791 microseconds.
(= 102921 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING: The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Spawning 4 threads.
Function Rate (MB/s) RMS time Min time Max time
Assignment: 360.9111 1.2396 1.1083 1.4904
Scaling : 356.7705 1.3061 1.1212 1.8792
Summing : 455.8100 1.5042 1.3163 2.0646
SAXPYing : 456.0660 1.4388 1.3156 1.5524
================================== 8 CPU ==================================
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 25000000, Offset = 0
Total memory required = 572.2 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity/precision appears to be 28 microseconds.
Each test below will take on the order of 2941454 microseconds.
(= 105051 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING: The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Spawning 8 threads.
Function Rate (MB/s) RMS time Min time Max time
Assignment: 442.1536 0.9316 0.9047 1.1391
Scaling : 437.3879 0.9670 0.9145 1.3468
Summing : 536.7600 1.1191 1.1178 1.1233
SAXPYing : 535.3591 1.1227 1.1207 1.1257
================================= 12 CPU ==================================
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 25000000, Offset = 0
Total memory required = 572.2 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity/precision appears to be 29 microseconds.
Each test below will take on the order of 2769588 microseconds.
(= 95503 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING: The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Spawning 12 threads.
Function Rate (MB/s) RMS time Min time Max time
Assignment: 665.9309 0.6387 0.6007 0.9087
Scaling : 658.3354 0.6339 0.6076 0.8263
Summing : 807.5023 0.7441 0.7430 0.7462
SAXPYing : 803.7552 0.7479 0.7465 0.7518
================================= 16 CPU ==================================
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 25000000, Offset = 0
Total memory required = 572.2 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity/precision appears to be 28 microseconds.
Each test below will take on the order of 2776748 microseconds.
(= 99169 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING: The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Spawning 16 threads.
Function Rate (MB/s) RMS time Min time Max time
Assignment: 878.5454 0.5988 0.4553 1.3092
Scaling : 862.2159 0.5140 0.4639 0.8341
Summing : 1068.6442 0.5624 0.5615 0.5636
SAXPYing : 1063.5205 0.5657 0.5642 0.5667
================================= 20 CPU ==================================
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 25000000, Offset = 0
Total memory required = 572.2 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity/precision appears to be 28 microseconds.
Each test below will take on the order of 2705785 microseconds.
(= 96635 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING: The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Spawning 20 threads.
Function Rate (MB/s) RMS time Min time Max time
Assignment: 1100.3250 0.4621 0.3635 0.9708
Scaling : 1082.5879 0.4069 0.3695 0.6396
Summing : 1334.8134 0.4499 0.4495 0.4503
SAXPYing : 1322.2852 0.4543 0.4538 0.4550
================================= 24 CPU ==================================
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 25000000, Offset = 0
Total memory required = 572.2 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity/precision appears to be 28 microseconds.
Each test below will take on the order of 2668254 microseconds.
(= 95294 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING: The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Spawning 24 threads.
Function Rate (MB/s) RMS time Min time Max time
Assignment: 1316.0965 0.3999 0.3039 0.8740
Scaling : 1283.5896 0.3509 0.3116 0.5947
Summing : 1599.8805 0.3754 0.3750 0.3759
SAXPYing : 1599.9832 0.3794 0.3750 0.3821
================================= 28 CPU ==================================
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 25000000, Offset = 0
Total memory required = 572.2 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity/precision appears to be 28 microseconds.
Each test below will take on the order of 2597999 microseconds.
(= 92785 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING: The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Spawning 28 threads.
Function Rate (MB/s) RMS time Min time Max time
Assignment: 1563.3857 0.3356 0.2559 0.7306
Scaling : 1524.0363 0.2947 0.2625 0.4924
Summing : 1886.1104 0.3188 0.3181 0.3197
SAXPYing : 1860.2286 0.3228 0.3225 0.3230
================================= 32 CPU ==================================
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 25000000, Offset = 0
Total memory required = 572.2 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity/precision appears to be 28 microseconds.
Each test below will take on the order of 2634035 microseconds.
(= 94072 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING: The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Spawning 32 threads.
Function Rate (MB/s) RMS time Min time Max time
Assignment: 1777.6834 0.3055 0.2250 0.6900
Scaling : 1721.8297 0.2649 0.2323 0.4614
Summing : 2158.8011 0.2791 0.2779 0.2843
SAXPYing : 2164.6038 0.2824 0.2772 0.2841
================================ SOURCE ==============================
------------------------------- streams.h ------------------------------
# define N 25000000
# define NTIMES 10
# define OFFSET 0
-------------------------------- main.c ------------------------------
# include <stdio.h>
# include <math.h>
# include <float.h>
# include <limits.h>
# include <sys/time.h>
/* Include Convex parallel support library */
# include <cps.h>
# include <spp_prog_model.h>
barrier_t b1;
int total_threads, num_args, total_nodes;
spawn_sym_t cnx_sp= { CPS_ANY_NODE, 1, 1, CPS_THREAD_PARALLEL };
void kernel();
int retval;
#include "streams.h"
/*
* Program: Stream
* Programmer: Joe R. Zagar
* Revision: 4.0-BETA, October 24, 1995
* Original code developed by John D. McCalpin
*
* This program measures memory transfer rates in MB/s for simple
* computational kernels coded in C. These numbers reveal the quality
* of code generation for simple uncacheable kernels as well as showing
* the cost of floating-point operations relative to memory accesses.
*
* INSTRUCTIONS:
*
* 1) Stream requires a good bit of memory to run. Adjust the
* value of 'N' (below) to give a 'timing calibration' of
* at least 20 clock-ticks. This will provide rate estimates
* that should be good to about 5% precision.
*/
/*
* 3) Compile the code with full optimization. Many compilers
* generate unreasonably bad code before the optimizer tightens
* things up. If the results are unreasonably good, on the
* other hand, the optimizer might be too smart for me!
*
* Try compiling with:
* cc -O stream_d.c second.c -o stream_d -lm
*
* This is known to work on Cray, SGI, IBM, and Sun machines.
*
*
* 4) Mail the results to mccalpin@udel.edu
* Be sure to include:
* a) computer hardware model number and software revision
* b) the compiler flags
* c) all of the output from the test case.
* Thanks!
*
*/
# define HLINE "-------------------------------------------------------------\n"
# ifndef MIN
# define MIN(x,y) ((x)<(y)?(x):(y))
# endif
# ifndef MAX
# define MAX(x,y) ((x)>(y)?(x):(y))
# endif
static node_private double a[N+OFFSET],
b[N+OFFSET],
c[N+OFFSET];
static double rmstime[4] = {0}, maxtime[4] = {0},
mintime[4] = {FLT_MAX,FLT_MAX,FLT_MAX,FLT_MAX};
static char *label[4] = {"Assignment:", "Scaling :",
"Summing :", "SAXPYing :"};
static double bytes[4] = {
2 * sizeof(double) * N,
2 * sizeof(double) * N,
3 * sizeof(double) * N,
3 * sizeof(double) * N
};
double second();
double times[4][NTIMES];
int
main(argc,argv)
int argc;
char **argv;
{
int quantum, checktick();
int BytesPerWord;
register int j, k;
double scalar, t;
if( argc != 2 )
{
fprintf(stderr,"Usage: par_stream_d <#threads>\n");
exit(-1);
}
total_threads= atoi( argv[1] );
/* --- SETUP --- determine precision and check timing --- */
printf(HLINE);
BytesPerWord = sizeof(double);
printf("This system uses %d bytes per DOUBLE PRECISION word.\n",
BytesPerWord);
printf(HLINE);
printf("Array size = %d, Offset = %d\n" , N, OFFSET);
printf("Total memory required = %.1f MB.\n",
(3 * N * BytesPerWord) / 1048576.0);
printf("Each test is run %d times, but only\n", NTIMES);
printf("the *best* time for each is used.\n");
/* Get initial value for system clock. */
total_nodes= cps_complex_nodes();
# pragma _CNX loop_parallel(nodes,ivar=k)
for (k=0; k<total_nodes; k++)
for (j=0; j<N; j++) {
a[j] = 1.0;
b[j] = 2.0;
c[j] = 0.0;
}
printf(HLINE);
if ( (quantum = checktick()) >= 1)
printf("Your clock granularity/precision appears to be "
"%d microseconds.\n", quantum);
else
printf("Your clock granularity appears to be "
"less than one microsecond.\n");
t = second();
for (j = 0; j < N; j++)
a[j] = 2.0E0 * a[j];
t = 1.0E6 * (second() - t);
printf("Each test below will take on the order"
" of %d microseconds.\n", (int) t );
printf(" (= %d clock ticks)\n", (int) (t/quantum) );
printf("Increase the size of the arrays if this shows that\n");
printf("you are not getting at least 20 clock ticks per test.\n");
printf(HLINE);
printf("WARNING: The above is only a rough guideline.\n");
printf("For best results, please be sure you know the\n");
printf("precision of your system timer.\n");
printf(HLINE);
/* --- MAIN LOOP --- repeat test cases NTIMES times --- */
alloc_barrier( &b1 );
cnx_sp.node= CPS_ANY_NODE;
cnx_sp.min= total_threads;
cnx_sp.max= total_threads;
cnx_sp.threadscope= CPS_THREAD_PARALLEL;
num_args= 3;
printf("Spawning %d threads.\n",total_threads);
retval= cps_ppcalln( &cnx_sp, kernel, &num_args, a, b, c );
if( retval < 0 ) { perror("cps_ppcalln"); exit(1); }
/* --- SUMMARY --- */
for (k=0; k<NTIMES; k++)
{
for (j=0; j<4; j++)
{
rmstime[j] = rmstime[j] + (times[j][k] * times[j][k]);
mintime[j] = MIN(mintime[j], times[j][k]);
maxtime[j] = MAX(maxtime[j], times[j][k]);
}
}
printf("Function Rate (MB/s) RMS time Min time Max time\n");
for (j=0; j<4; j++) {
rmstime[j] = sqrt(rmstime[j]/(double)NTIMES);
printf("%s%11.4f %11.4f %11.4f %11.4f\n", label[j],
1.0E-06 * bytes[j]/mintime[j],
rmstime[j],
mintime[j],
maxtime[j]);
}
return 0;
}
# define M 20
int
checktick()
{
int i, minDelta, Delta;
double t1, t2, timesfound[M];
/* Collect a sequence of M unique time values from the system. */
for (i = 0; i < M; i++) {
t1 = second();
while( ((t2=second()) - t1) < 1.0E-6 )
;
timesfound[i] = t1 = t2;
}
/*
* Determine the minimum difference between these M values.
* This result will be our estimate (in microseconds) for the
* clock granularity.
*/
minDelta = 1000000;
for (i = 1; i < M; i++) {
Delta = (int)( 1.0E6 * (timesfound[i]-timesfound[i-1]));
minDelta = MIN(minDelta, MAX(Delta,0));
}
return(minDelta);
}
------------------------------- kernel.c ------------------------------
# include <stdio.h>
# include <math.h>
# include <float.h>
# include <limits.h>
# include <sys/time.h>
# include <cps.h>
# include <spp_prog_model.h>
#include "streams.h"
extern double second();
extern double times[4][NTIMES];
extern barrier_t b1;
extern int total_threads;
void
kernel(a,b,c)
double *a, *b, *c;
{
double scalar;
int thread_id, chunksize, start_indx, stop_indx;
int quantum, checktick();
int BytesPerWord;
register int j, k;
/* --- MAIN LOOP --- repeat test cases NTIMES times --- */
thread_id= cps_stid();
chunksize= N / total_threads;
start_indx= thread_id * chunksize;
stop_indx= (thread_id+1) * chunksize;
if( thread_id >= (total_threads - 1)) stop_indx= N;
scalar = 3.0;
for (k=0; k<NTIMES; k++)
{
wait_barrier( &b1, &total_threads );
if( thread_id <= 0 ) times[0][k] = second();
for (j=start_indx; j<stop_indx; j++)
c[j] = a[j];
wait_barrier( &b1, &total_threads );
if( thread_id <= 0 ) times[0][k] = second() - times[0][k];
wait_barrier( &b1, &total_threads );
if( thread_id <= 0 ) times[1][k] = second();
for (j=start_indx; j<stop_indx; j++)
b[j] = scalar*c[j];
wait_barrier( &b1, &total_threads );
if( thread_id <= 0 ) times[1][k] = second() - times[1][k];
wait_barrier( &b1, &total_threads );
if( thread_id <= 0 ) times[2][k] = second();
for (j=start_indx; j<stop_indx; j++)
c[j] = a[j]+b[j];
wait_barrier( &b1, &total_threads );
if( thread_id <= 0 ) times[2][k] = second() - times[2][k];
wait_barrier( &b1, &total_threads );
if( thread_id <= 0 ) times[3][k] = second();
for (j=start_indx; j<stop_indx; j++)
a[j] = b[j]+scalar*c[j];
wait_barrier( &b1, &total_threads );
if( thread_id <= 0 ) times[3][k] = second() - times[3][k];
}
return ;
}
------------------------------- second.c ------------------------------
/* A Fortran-callable gettimeofday routine to give access
to the wall clock timer.
This subroutine may need to be modified slightly to get
it to link with Fortran on your computer.
*/
#include <sys/time.h>
/* int gettimeofday(struct timeval *tp, struct timezone *tzp); */
double second()
{
/* struct timeval { long tv_sec;
long tv_usec; };
struct timezone { int tz_minuteswest;
int tz_dsttime; }; */
struct timeval tp;
struct timezone tzp;
int i;
i = gettimeofday(&tp,&tzp);
return ( (double) tp.tv_sec + (double) tp.tv_usec * 1.e-6 );
}
------------------------------- makefile ------------------------------
FC= /opt/fortran/bin/f77
CC= /opt/ansic/bin/cc
OPT= +O2
FFLAGS= $(OPT)
CFLAGS= $(OPT)
LDFLAGS= -Wl,-aarchive
LD= $(FC)
INCLUDE= -I.
all: stream_d par_stream_d
stream_d: stream_d.o second.o
$(LD) stream_d.o second.o $(LDFLAGS) -o stream_d
par_stream_d: main.o kernel.o second.o
$(LD) main.o kernel.o second.o $(LDFLAGS) -o par_stream_d
stream_d.o: stream_d.f
second.o: second.c
main.o: main.c streams.h
/usr/convex/bin/cc -O3 -c main.c
kernel.o: kernel.c streams.h
$(CC) $(OPT) +Onoparmsoverlap -c kernel.c $(INCLUDE)
This archive was generated by hypermail 2b29 : Tue Apr 18 2000 - 05:23:05 CDT