Here are my results from running STREAM on the configuration listed
below. For the most part the results are tightly grouped with little
variation. I noticed that I achieved better results as I increased the
number of OMP_NUM_THREADS, until they peaked at 8. Incidently this is
the number of cores in my system (2 x Quad core).
However, on my dual CPU AMD 2212HE based system the results are not very
consistent at all. When using identical software and configuration to
that listed below, my results can range from 3500MB/s to 7700MB/s. Do
you have any advice or suggestions about what could be wrong?
--------------------------------------------------------------
Hardware:
- Dell PowerEdge 1950 III
- Two (2) Intel Xeon Quad Core E5440
- Memory stick confirguration (from factory): 2G x 0 x 2G x 0 x 2G x 0 x
2G x 0 = 8G total memory
--------------------------------------------------------------
Operating System:
- Ubuntu 8.04 AMD64 Desktop
Compiled as follows:
gcc -O3 -fopenmp -D_OPENMP stream.c -o stream
Run as follows:
export OMP_NUM_THREADS=8
./stream
--------------------------------------------------------------
Results:
[cyoub@cyoub.edmonton.yottayotta.com]$ cat results
-------------------------------------------------------------
STREAM version $Revision: 5.8 $
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 2000000, Offset = 0
Total memory required = 45.8 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Number of Threads requested = 8
-------------------------------------------------------------
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 3380 microseconds.
(= 3380 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Rate (MB/s) Avg time Min time Max time
Copy: 6341.4944 0.0051 0.0050 0.0051
Scale: 5992.3979 0.0054 0.0053 0.0054
Add: 5572.2832 0.0086 0.0086 0.0086
Triad: 5771.3161 0.0083 0.0083 0.0083
-------------------------------------------------------------
Solution Validates
-------------------------------------------------------------
-------------------------------------------------------------
STREAM version $Revision: 5.8 $
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 2000000, Offset = 0
Total memory required = 45.8 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Number of Threads requested = 8
-------------------------------------------------------------
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 3644 microseconds.
(= 3644 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Rate (MB/s) Avg time Min time Max time
Copy: 6282.1310 0.0051 0.0051 0.0051
Scale: 5982.2485 0.0054 0.0053 0.0056
Add: 5561.3544 0.0087 0.0086 0.0087
Triad: 5761.5715 0.0083 0.0083 0.0084
-------------------------------------------------------------
Solution Validates
-------------------------------------------------------------
-------------------------------------------------------------
STREAM version $Revision: 5.8 $
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 2000000, Offset = 0
Total memory required = 45.8 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Number of Threads requested = 8
-------------------------------------------------------------
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 5143 microseconds.
(= 5143 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Rate (MB/s) Avg time Min time Max time
Copy: 6268.3415 0.0054 0.0051 0.0071
Scale: 5839.3617 0.0056 0.0055 0.0060
Add: 5910.5922 0.0084 0.0081 0.0085
Triad: 5944.9754 0.0082 0.0081 0.0083
-------------------------------------------------------------
Solution Validates
-------------------------------------------------------------
-------------------------------------------------------------
STREAM version $Revision: 5.8 $
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 2000000, Offset = 0
Total memory required = 45.8 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Number of Threads requested = 8
-------------------------------------------------------------
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 2902 microseconds.
(= 2902 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Rate (MB/s) Avg time Min time Max time
Copy: 6566.7463 0.0051 0.0049 0.0053
Scale: 6043.3936 0.0054 0.0053 0.0056
Add: 5772.8055 0.0086 0.0083 0.0087
Triad: 5993.2898 0.0082 0.0080 0.0084
-------------------------------------------------------------
Solution Validates
-------------------------------------------------------------
-------------------------------------------------------------
STREAM version $Revision: 5.8 $
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 2000000, Offset = 0
Total memory required = 45.8 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Number of Threads requested = 8
-------------------------------------------------------------
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 5073 microseconds.
(= 5073 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Rate (MB/s) Avg time Min time Max time
Copy: 6210.0462 0.0058 0.0052 0.0074
Scale: 5911.6335 0.0061 0.0054 0.0078
Add: 5853.0276 0.0090 0.0082 0.0104
Triad: 5756.6292 0.0087 0.0083 0.0106
-------------------------------------------------------------
Solution Validates
-------------------------------------------------------------
Received on Sat May 03 10:06:03 2008
This archive was generated by hypermail 2.1.8 : Sat Dec 06 2008 - 12:15:45 CST