I hope you do want these results sent directly to you -- I couldn't find
any indication on the STREAM web page as to where to send them to!
Here are some STREAM results from three new machines we picked up where
I work. I don't have any fancy compilers, so I'm sure these are somewhat
lower than the numbers one could hit by using fancy FPU instructions to
copy (e.g. SSE on the P-III machines or 3DNow on the Athlons). Everything
was compiled with gcc 2.91; flags were -O3 -fomit-frame-pointer
-fstrict-aliasing -mcpu=pentiumpro -march=pentiumpro -static. The
machines weren't entirely idle so I used second_cpu.c.
Machine 1 is an Athlon 1333MHz (266MHz FSB) with AMD 761 memory controller
and 512MB of 133MHZ DDR SDRAM ("PC2100" memory). ECC is turned *off* due
to motherboard limitations. The memory is unbuffered. The motherboard is
an Asus A7M266.
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 10000000, Offset = 0
Total memory required = 228.9 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity/precision appears to be 9999 microseconds.
Each test below will take on the order of 249999 microseconds.
(= 25 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Rate (MB/s) RMS time Min time Max time
Copy: 941.1765 0.1852 0.1700 0.2000
Scale: 592.5926 0.2851 0.2700 0.3000
Add: 727.2727 0.3462 0.3300 0.3600
Triad: 685.7143 0.3500 0.3500 0.3500
Machine 2 is very similar to Machine 1, except that its DDR SDRAM is
registered (Machine 1's memory is unbuffered) and ECC is turned on.
The motherboard is an ABIT KG7; again, the memory controller is AMD 761.
This board doesn't flip out when I turn ECC on, thus it is on.
Interestingly, the results are almost the same; consistently *better*
on the Triad kernel. I'd expect a performance drop from having ECC
turned on, as well as from using registered memory, but I guess the
principal impact of each is on latency, not throughput.
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 10000000, Offset = 0
Total memory required = 228.9 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity/precision appears to be 9999 microseconds.
Each test below will take on the order of 259999 microseconds.
(= 26 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Rate (MB/s) RMS time Min time Max time
Copy: 941.1765 0.2012 0.1700 0.2800
Scale: 592.5926 0.2861 0.2700 0.3000
Add: 727.2727 0.3492 0.3300 0.3700
Triad: 727.2727 0.3511 0.3300 0.3600
Machine 3 is a dual-processor 1GHz Pentium III on a SuperMicro 370DLE
motherboard; the memory controller is integrated into the ServerWorks III
chipset. The memory is 133MHz SDRAM; ServerWorks III does interleaving
if possible (though only 2-way, I believe); this machine has memory in all
banks. I think we're seeing pin bandwidth limitations of the CPU here;
interleaved SDR SDRAM *ought* to have about the same throughput as DDR
SDRAM at the same clock, right?
I ran only one copy of STREAM at a time on this machine. I hope that was
the correct thing to do.
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 10000000, Offset = 0
Total memory required = 228.9 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity/precision appears to be 9999 microseconds.
Each test below will take on the order of 330000 microseconds.
(= 33 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Rate (MB/s) RMS time Min time Max time
Copy: 390.2439 0.4100 0.4100 0.4100
Scale: 400.0000 0.4070 0.4000 0.4100
Add: 510.6383 0.4770 0.4700 0.4800
Triad: 358.2090 0.6750 0.6700 0.6800
This archive was generated by hypermail 2b29 : Wed Oct 31 2001 - 11:26:46 CST