John,
I modified the STREAM benchmark in C to be done using pthreads so that I
could run the same thing on Solaris and AIX. I would like to contribute
the code back. I didn't have a fortran compiler and wasn't aware that
the -qsmp option would allow me to use compiler directives to thread the
kernels. However, my threading of the kernels is exactly in accordance
with the operation of the parallel directives and the results I achieved
running this benchmark on other systems that have already been reported
on the site were similar.
I am attaching the run log for an IBM p680 with 96Gb of memory and 24
CPUs. I ran it in stepped sequence starting with one CPU and then
incrementing to two then to 24 in steps of two. I am also attaching the
code and Makefile for your consideration and if you would like to update
the code base with this version, I have no objections. Of course it
probably would have to be named differently since I didn't bother to
keep the non-threaded functionality in this code. My compiler which is
IBM C 3.6.4 didn't support the pragmas or at least I couldn't find any
mention of them so I resorted to pthreads.
There is one little bug in the code that I fixed and that is the
reporting of total memory output which should use 3.0 as the multiplier
instead of 3. I forgot to make that change before I ran this set of
data and hence it quits reporting memory size correctly above 2.1Gb. I
used 8000000 elements per CPU for this test because the S85 has a 16Mb
L2 cache.
Regards,
William R. Sullivan
CTO WHAM Engineering & Software
512-345-9925 xt 110
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 8000000, Offset = 0
Total memory required = 183.1 MB.
Each test is run 20 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity/precision appears to be 3 microseconds.
Each test below will take on the order of 262854 microseconds.
(= 87618 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Rate (MB/s) RMS time Min time Max time
Copy: 312.9753 0.4211 0.4090 0.4222
Scale: 304.6545 0.4218 0.4201 0.4228
Add: 315.4009 0.6088 0.6087 0.6089
Triad: 313.6486 0.6122 0.6121 0.6125
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 16000000, Offset = 0
Total memory required = 366.2 MB.
Each test is run 20 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity/precision appears to be 3 microseconds.
Each test below will take on the order of 284336 microseconds.
(= 94778 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Rate (MB/s) RMS time Min time Max time
Copy: 603.7067 0.4289 0.4240 0.4380
Scale: 603.2884 0.4293 0.4243 0.4394
Add: 625.0509 0.6177 0.6143 0.6252
Triad: 625.2045 0.6172 0.6142 0.6242
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 32000000, Offset = 0
Total memory required = 732.4 MB.
Each test is run 20 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity/precision appears to be 3 microseconds.
Each test below will take on the order of 278615 microseconds.
(= 92871 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Rate (MB/s) RMS time Min time Max time
Copy: 1096.5431 0.4704 0.4669 0.4755
Scale: 1162.0388 0.4442 0.4406 0.4491
Add: 1163.5641 0.6656 0.6600 0.6725
Triad: 1165.4427 0.6640 0.6590 0.6702
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 48000000, Offset = 0
Total memory required = 1098.6 MB.
Each test is run 20 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity/precision appears to be 3 microseconds.
Each test below will take on the order of 295103 microseconds.
(= 98367 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Rate (MB/s) RMS time Min time Max time
Copy: 1649.1800 0.4704 0.4657 0.4939
Scale: 1633.8547 0.4741 0.4701 0.4890
Add: 1722.6529 0.6747 0.6687 0.6973
Triad: 1733.4543 0.6707 0.6646 0.6937
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 64000000, Offset = 0
Total memory required = 1464.8 MB.
Each test is run 20 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity/precision appears to be 3 microseconds.
Each test below will take on the order of 296449 microseconds.
(= 98816 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Rate (MB/s) RMS time Min time Max time
Copy: 2205.9456 0.4717 0.4642 0.4814
Scale: 2037.3124 0.5108 0.5026 0.5147
Add: 2101.2111 0.7397 0.7310 0.7436
Triad: 2112.6295 0.7356 0.7271 0.7401
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 80000000, Offset = 0
Total memory required = 1831.1 MB.
Each test is run 20 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity/precision appears to be 3 microseconds.
Each test below will take on the order of 296009 microseconds.
(= 98669 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Rate (MB/s) RMS time Min time Max time
Copy: 2675.5400 0.4791 0.4784 0.4865
Scale: 2645.1091 0.4842 0.4839 0.4845
Add: 2776.5365 0.6917 0.6915 0.6920
Triad: 2794.9188 0.6873 0.6870 0.6875
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 96000000, Offset = 0
Total memory required = -1898.7 MB.
Each test is run 20 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity/precision appears to be 3 microseconds.
Each test below will take on the order of 315271 microseconds.
(= 105090 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Rate (MB/s) RMS time Min time Max time
Copy: 2815.2951 0.5467 0.5456 0.5588
Scale: 2910.5629 0.5282 0.5277 0.5286
Add: 2988.3461 0.7717 0.7710 0.7723
Triad: 3004.2482 0.7677 0.7669 0.7684
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 112000000, Offset = 0
Total memory required = -1532.5 MB.
Each test is run 20 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity/precision appears to be 3 microseconds.
Each test below will take on the order of 323757 microseconds.
(= 107919 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Rate (MB/s) RMS time Min time Max time
Copy: 3484.6595 0.5256 0.5143 0.5481
Scale: 3439.0371 0.5313 0.5211 0.5419
Add: 3648.4460 0.7493 0.7368 0.7615
Triad: 3669.3740 0.7451 0.7326 0.7574
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 128000000, Offset = 0
Total memory required = -1166.3 MB.
Each test is run 20 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity/precision appears to be 3 microseconds.
Each test below will take on the order of 327443 microseconds.
(= 109147 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Rate (MB/s) RMS time Min time Max time
Copy: 3688.6017 0.5700 0.5552 0.5878
Scale: 3773.1053 0.5574 0.5428 0.5609
Add: 3933.9626 0.7982 0.7809 0.8017
Triad: 3954.1159 0.7942 0.7769 0.7978
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 144000000, Offset = 0
Total memory required = -800.1 MB.
Each test is run 20 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity/precision appears to be 3 microseconds.
Each test below will take on the order of 342617 microseconds.
(= 114205 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Rate (MB/s) RMS time Min time Max time
Copy: 4185.0502 0.5583 0.5505 0.5882
Scale: 4121.6975 0.5655 0.5590 0.5743
Add: 4448.7128 0.7845 0.7769 0.7974
Triad: 4482.8741 0.7785 0.7709 0.7918
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 160000000, Offset = 0
Total memory required = -433.9 MB.
Each test is run 20 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity/precision appears to be 3 microseconds.
Each test below will take on the order of 366235 microseconds.
(= 122078 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Rate (MB/s) RMS time Min time Max time
Copy: 4229.1607 0.6070 0.6053 0.6298
Scale: 4297.3644 0.5968 0.5957 0.5975
Add: 4581.6340 0.8399 0.8381 0.8409
Triad: 4613.3493 0.8328 0.8324 0.8332
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 176000000, Offset = 0
Total memory required = -67.7 MB.
Each test is run 20 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity/precision appears to be 3 microseconds.
Each test below will take on the order of 383044 microseconds.
(= 127681 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Rate (MB/s) RMS time Min time Max time
Copy: 4596.9430 0.6150 0.6126 0.6453
Scale: 4515.8818 0.6243 0.6236 0.6252
Add: 4951.7604 0.8544 0.8530 0.8562
Triad: 5004.1825 0.8456 0.8441 0.8467
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 192000000, Offset = 0
Total memory required = 298.5 MB.
Each test is run 20 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity/precision appears to be 3 microseconds.
Each test below will take on the order of 799821 microseconds.
(= 266607 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Rate (MB/s) RMS time Min time Max time
Copy: 4726.2996 0.6584 0.6500 0.6878
Scale: 4556.9921 0.6797 0.6741 0.6800
Add: 5008.7225 0.9277 0.9200 0.9300
Triad: 5036.2141 0.9194 0.9150 0.9200
This archive was generated by hypermail 2b29 : Wed Oct 31 2001 - 11:26:46 CST