Hi John - I hope you're well. Here are STREAM results from a dual
Athlon machine at my old Brown haunts:
Dual AthlonMP 1.2GHz with 896MB of visible DDR memory running Linux 2.4.9 SMP:
processor : 0
vendor_id : AuthenticAMD
cpu family : 6
model : 6
model name : AMD Athlon(tm) Processor
stepping : 1
cpu MHz : 1194.693
cache size : 256 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse syscall mmxext 3dnowext 3dnow
bogomips : 2385.51
PGI 3.2.4 compilers
Serial Fortran 5.0 code using etime timer
pgf77 -fast -tp athlon -Minline -Mvect=assoc,prefetch,cachesize:327680 -Mcache_align stream_d.f second_cpu.f
With really large arrays:
----------------------------------------------
Double precision appears to have 16 digits of accuracy
Assuming 8 bytes per DOUBLE PRECISION word
----------------------------------------------
Array size = 24000000
Offset = 0
The total memory requirement is 549 MB
You are running each test 10 times
--
The *best* time for each test is used
*EXCLUDING* the first and last iterations
----------------------------------------------------
Your clock granularity/precision appears to be 10000 microseconds
----------------------------------------------------
Function Rate (MB/s) Avg time Min time Max time
Copy: 619.3549 0.6237 0.6200 0.6300
Scale: 698.1818 0.5512 0.5500 0.5600
Add: 757.8948 0.7687 0.7600 0.7800
Triad: 757.8948 0.7687 0.7600 0.7700
----------------------------------------------------
Solution Validates!
----------------------------------------------------
With smaller arrays (less trouble for the TLB maybe?):
----------------------------------------------
Double precision appears to have 16 digits of accuracy
Assuming 8 bytes per DOUBLE PRECISION word
----------------------------------------------
Array size = 10000000
Offset = 0
The total memory requirement is 228 MB
You are running each test 10 times
--
The *best* time for each test is used
*EXCLUDING* the first and last iterations
----------------------------------------------------
Your clock granularity/precision appears to be 10000 microseconds
----------------------------------------------------
Function Rate (MB/s) Avg time Min time Max time
Copy: 727.2727 0.2212 0.2200 0.2300
Scale: 695.6522 0.2325 0.2300 0.2500
Add: 827.5862 0.2975 0.2900 0.3000
Triad: 827.5862 0.2950 0.2900 0.3000
----------------------------------------------------
Solution Validates!
----------------------------------------------------
Running 2 copies simultaneously:
----------------------------------------------
Double precision appears to have 16 digits of accuracy
Assuming 8 bytes per DOUBLE PRECISION word
----------------------------------------------
Array size = 10000000
Offset = 0
The total memory requirement is 228 MB
You are running each test 10 times
--
The *best* time for each test is used
*EXCLUDING* the first and last iterations
----------------------------------------------------
Your clock granularity/precision appears to be 10000 microseconds
----------------------------------------------------
Function Rate (MB/s) Avg time Min time Max time
Copy: 533.3333 0.3075 0.3000 0.3200
Scale: 516.1290 0.3287 0.3100 0.3400
Add: 585.3659 0.4225 0.4100 0.4300
Triad: 600.0000 0.4162 0.4000 0.4300
----------------------------------------------------
Solution Validates!
----------------------------------------------------
----------------------------------------------
Double precision appears to have 16 digits of accuracy
Assuming 8 bytes per DOUBLE PRECISION word
----------------------------------------------
Array size = 10000000
Offset = 0
The total memory requirement is 228 MB
You are running each test 10 times
--
The *best* time for each test is used
*EXCLUDING* the first and last iterations
----------------------------------------------------
Your clock granularity/precision appears to be 10000 microseconds
----------------------------------------------------
Function Rate (MB/s) Avg time Min time Max time
Copy: 533.3333 0.3137 0.3000 0.3200
Scale: 516.1290 0.3312 0.3100 0.3400
Add: 585.3659 0.4262 0.4100 0.4400
Triad: 585.3659 0.4200 0.4100 0.4300
----------------------------------------------------
Solution Validates!
----------------------------------------------------
OpenMP parallel Fortran 5.0 code using gettimeofday timer
pgcc -omp -fast -tp athlon -Minline -Mvect=assoc,prefetch,cachesize:327680 -Mcache_align -c second_wall_f.c
pgf77 -mp -fast -tp athlon -Minline -Mvect=assoc,prefetch,cachesize:327680 -Mcache_align stream_d.f second_wall_f.o
1 processor run
----------------------------------------------
Double precision appears to have 16 digits of accuracy
Assuming 8 bytes per DOUBLE PRECISION word
----------------------------------------------
Array size = 10000000
Offset = 0
The total memory requirement is 228 MB
You are running each test 10 times
--
The *best* time for each test is used
*EXCLUDING* the first and last iterations
----------------------------------------------------
Your clock granularity appears to be less than one microsecond
Your clock granularity/precision appears to be 1 microseconds
----------------------------------------------------
Function Rate (MB/s) Avg time Min time Max time
Copy: 726.7869 0.2203 0.2201 0.2205
Scale: 711.7692 0.2249 0.2248 0.2251
Add: 860.0672 0.2791 0.2790 0.2793
Triad: 851.4927 0.2820 0.2819 0.2821
----------------------------------------------------
Solution Validates!
----------------------------------------------------
10.710u 0.530s 0:11.24 100.0% 0+0k 0+0io 159pf+0w
2 processor run
----------------------------------------------
Double precision appears to have 16 digits of accuracy
Assuming 8 bytes per DOUBLE PRECISION word
----------------------------------------------
Array size = 10000000
Offset = 0
The total memory requirement is 228 MB
You are running each test 10 times
--
The *best* time for each test is used
*EXCLUDING* the first and last iterations
----------------------------------------------------
Your clock granularity appears to be less than one microsecond
Your clock granularity/precision appears to be 1 microseconds
----------------------------------------------------
Function Rate (MB/s) Avg time Min time Max time
Copy: 922.0414 0.1742 0.1735 0.1752
Scale: 916.4276 0.1748 0.1746 0.1749
Add: 1051.7321 0.2285 0.2282 0.2293
Triad: 1053.3937 0.2283 0.2278 0.2299
----------------------------------------------------
Solution Validates!
----------------------------------------------------
16.880u 0.710s 0:08.99 195.6% 0+0k 0+0io 161pf+0w
Serial Fortran (Grassl stream_offset.f) code using inline etime timer
pgf77 -fast -tp athlon -Minline -Mvect=assoc,prefetch,cachesize:327680 -Mcache_align stream_offset.f
With really large arrays: 677.993 634.261 797.109 776.132
FORTRAN STOP
STREAM benchmark
----------------
Number of CPUs: 1
Array size: 24001 Kwords
Array padding: 20248 Words
Aoff Boff Coff Assignment:Scaling: Summing: SAXPYing:
-------------------------------------------------------------
8 8 8 634.246 634.246 737.311 719.328
16 8 8 634.246 634.246 737.311 719.328
24 8 8 655.387 634.246 737.312 719.328
32 8 8 655.387 634.246 737.312 719.328
8 16 8 655.387 634.246 797.094 776.116
16 16 8 655.387 634.246 797.094 776.116
24 16 8 655.388 634.247 797.094 776.117
32 16 8 655.388 634.247 797.094 776.117
8 24 8 655.388 634.247 797.094 776.117
16 24 8 655.388 634.247 797.094 776.117
24 24 8 655.388 634.247 797.094 776.117
32 24 8 655.388 634.247 797.094 776.117
8 32 8 655.388 634.247 797.096 776.120
16 32 8 655.388 634.247 797.096 776.120
24 32 8 655.388 634.247 797.096 776.120
32 32 8 655.388 634.247 797.096 776.120
8 8 16 655.388 634.247 797.096 776.120
16 8 16 655.388 634.248 797.096 776.120
24 8 16 655.388 634.248 797.096 776.120
32 8 16 655.389 634.248 797.096 776.120
8 16 16 655.389 634.248 797.096 776.120
16 16 16 655.389 634.248 797.096 776.120
24 16 16 655.389 634.248 797.096 776.120
32 16 16 655.389 634.253 797.096 776.120
8 24 16 655.389 634.253 797.096 776.120
16 24 16 655.389 634.253 797.100 776.124
24 24 16 655.389 634.253 797.100 776.124
32 24 16 655.389 634.253 797.100 776.124
8 32 16 655.389 634.253 797.100 776.124
16 32 16 655.389 634.253 797.100 776.124
24 32 16 655.389 634.253 797.100 776.124
32 32 16 655.389 634.253 797.100 776.124
8 8 24 655.389 634.253 797.100 776.124
16 8 24 655.389 634.253 797.100 776.124
24 8 24 655.389 634.253 797.100 776.124
32 8 24 655.389 634.253 797.100 776.124
8 16 24 655.393 634.253 797.100 776.124
16 16 24 655.393 634.253 797.100 776.124
24 16 24 677.993 634.253 797.100 776.124
32 16 24 677.993 634.253 797.100 776.124
8 24 24 677.993 634.253 797.100 776.124
16 24 24 677.993 634.253 797.100 776.124
24 24 24 677.993 634.253 797.100 776.124
32 24 24 677.993 634.253 797.100 776.124
8 32 24 677.993 634.253 797.104 776.124
16 32 24 677.993 634.253 797.104 776.124
24 32 24 677.993 634.253 797.104 776.124
32 32 24 677.993 634.253 797.104 776.124
8 8 32 677.993 634.261 797.104 776.124
16 8 32 677.993 634.261 797.109 776.132
24 8 32 677.993 634.261 797.109 776.132
32 8 32 677.993 634.261 797.109 776.132
8 16 32 677.993 634.261 797.109 776.132
16 16 32 677.993 634.261 797.109 776.132
24 16 32 677.993 634.261 797.109 776.132
32 16 32 677.993 634.261 797.109 776.132
8 24 32 677.993 634.261 797.109 776.132
16 24 32 677.993 634.261 797.109 776.132
24 24 32 677.993 634.261 797.109 776.132
32 24 32 677.993 634.261 797.109 776.132
8 32 32 677.993 634.261 797.109 776.132
16 32 32 677.993 634.261 797.109 776.132
24 32 32 677.993 634.261 797.109 776.132
32 32 32 677.993 634.261 797.109 776.132
FORTRAN STOP
184.270u 0.800s 3:05.13 99.9% 0+0k 0+0io 139pf+0w
With smaller ones: 682.753 630.235 819.297 819.278
FORTRAN STOP
STREAM benchmark
----------------
Number of CPUs: 1
Array size: 10001 Kwords
Array padding: 20248 Words
Aoff Boff Coff Assignment:Scaling: Summing: SAXPYing:
-------------------------------------------------------------
8 8 8 630.217 630.217 722.896 722.896
16 8 8 630.217 630.217 722.896 722.896
24 8 8 682.735 630.217 722.896 722.896
32 8 8 682.735 630.217 722.896 722.896
8 16 8 682.735 630.217 819.281 768.077
16 16 8 682.735 630.217 819.281 768.077
24 16 8 682.735 630.218 819.283 768.078
32 16 8 682.735 630.218 819.283 768.078
8 24 8 682.735 630.219 819.283 768.078
16 24 8 682.735 630.219 819.283 768.078
24 24 8 682.735 630.219 819.283 768.078
32 24 8 682.735 630.219 819.283 768.078
8 32 8 682.735 630.219 819.283 768.078
16 32 8 682.735 630.219 819.283 768.078
24 32 8 682.735 630.219 819.283 768.081
32 32 8 682.738 630.219 819.283 768.081
8 8 16 682.738 630.219 819.283 768.081
16 8 16 682.738 630.219 819.283 819.278
24 8 16 682.738 630.219 819.283 819.278
32 8 16 682.738 630.219 819.285 819.278
8 16 16 682.738 630.219 819.285 819.278
16 16 16 682.738 630.221 819.285 819.278
24 16 16 682.738 630.221 819.285 819.278
32 16 16 682.738 630.221 819.285 819.278
8 24 16 682.738 630.221 819.285 819.278
16 24 16 682.738 630.221 819.285 819.278
24 24 16 682.738 630.223 819.285 819.278
32 24 16 682.738 630.226 819.285 819.278
8 32 16 682.738 630.226 819.285 819.278
16 32 16 682.738 630.226 819.285 819.278
24 32 16 682.738 630.226 819.285 819.278
32 32 16 682.738 630.226 819.285 819.278
8 8 24 682.738 630.226 819.285 819.278
16 8 24 682.738 630.226 819.285 819.278
24 8 24 682.743 630.226 819.285 819.278
32 8 24 682.743 630.226 819.285 819.278
8 16 24 682.743 630.226 819.285 819.278
16 16 24 682.743 630.226 819.285 819.278
24 16 24 682.743 630.226 819.285 819.278
32 16 24 682.743 630.226 819.285 819.278
8 24 24 682.743 630.226 819.285 819.278
16 24 24 682.743 630.226 819.285 819.278
24 24 24 682.743 630.226 819.285 819.278
32 24 24 682.743 630.226 819.285 819.278
8 32 24 682.743 630.226 819.291 819.278
16 32 24 682.743 630.226 819.291 819.278
24 32 24 682.743 630.226 819.291 819.278
32 32 24 682.743 630.226 819.291 819.278
8 8 32 682.743 630.226 819.291 819.278
16 8 32 682.743 630.226 819.291 819.278
24 8 32 682.743 630.226 819.291 819.278
32 8 32 682.743 630.226 819.291 819.278
8 16 32 682.743 630.226 819.291 819.278
16 16 32 682.743 630.228 819.291 819.278
24 16 32 682.743 630.228 819.291 819.278
32 16 32 682.743 630.228 819.291 819.278
8 24 32 682.743 630.228 819.291 819.278
16 24 32 682.743 630.228 819.291 819.278
24 24 32 682.743 630.232 819.291 819.278
32 24 32 682.743 630.235 819.297 819.278
8 32 32 682.753 630.235 819.297 819.278
16 32 32 682.753 630.235 819.297 819.278
24 32 32 682.753 630.235 819.297 819.278
32 32 32 682.753 630.235 819.297 819.278
76.890u 0.260s 1:17.15 100.0% 0+0k 0+0io 141pf+0w
I didn't waste more time to test the stream_d.c & stream_l.cpp
versions of the benchmark or the GNU compilers as in my experience
they are slower in STREAM.
I cannot reproduce numbers any closer to the ones that you have for
single processor Athlons at 1.2GHz (same chipset AMD 760) but were
obtained from a Tbird Athlon without the prefetch engine that the
Athlon4/MP/XP has. Either the Compaq Visual Fortran compiler produces
better (prefetching?) code or the memory configuration is different
despite the fact that in both cases it is DDR.
At least going from 1 to 2 procs the achievable bandwidth increases
somewhat.
Dr. Constantinos Evangelinos
Ocean Engineering
MIT
This archive was generated by hypermail 2b29 : Wed Oct 31 2001 - 11:26:48 CST