From: John D Mccalpin (mccalpin@us.ibm.com)
Date: Thu Apr 11 2002 - 20:13:20 CDT
Hi John,
These STREAM results have been approved for publication.
They were obtained on IBM eServer pSeries 690 HPC and Turbo
systems, respectively, running the version of AIX internally
called AIX 5.1D, which starts shipping in April, 2002. Each
system was configured with 128 GB RAM, consisting of eight 16
GB memory features.
These are "standard" results, with no code modifications.
The arrays were sized to each be at least 4x the size of the
512 MB of L3 cache in each system, so N=256,000,000.
These tests were run with memory affinity enabled, but did not
use large pages. We feel that these small page results better
represent the most common customer application environment.
Both large pages and code modifications can improve performance
in many (but not all) situations. The most effective code
modification involves the insertion of DCBZ instructions
to allocate the target array in the cache without reading it
from memory. An example of how to get the XLF 7 compiler to
do this is already on the STREAM web site, in the directory:
ftp://ftp.cs.virginia.edu/pub/stream/Code/Contrib/POWER4/
If there is sufficient customer demand, we will consider
publishing the "tuned" numbers using large pages and/or code
modifications. However, we believe that the numbers submitted
here establish the capability of the system quite clearly.
Summary of Improvements:
STANDARD 16p Published New Delta
Copy 17394 20267 14%
Scale 17066 20265 16%
Add 19676 24706 20%
Triad 20051 25058 20%
STANDARD 32p Published New Delta
Copy 22421 28611 22%
Scale 21411 28994 26%
Add 24830 32222 23%
Triad 25501 32249 21%
Detailed Results:
p690 HPC
----------------------------------------------
Double precision appears to have 16 digits of accuracy
Assuming 8 bytes per DOUBLE PRECISION word
----------------------------------------------
Array size = 256000000
Offset = 256
The total memory requirement is 5859 MB
You are running each test 10 times
--
The *best* time for each test is used
*EXCLUDING* the first and last iterations
rebind: num_parthds is 16
----------------------------------------------------
Your clock granularity appears to be less than one microsecond
Your clock granularity/precision appears to be 1 microseconds
----------------------------------------------------
Function Rate (MB/s) Avg time Min time Max time
Copy: 20266.5212 .2023 .2021 .2024
Scale: 20264.7402 .2023 .2021 .2024
Add: 24705.5623 .2488 .2487 .2489
Triad: 25058.2375 .2454 .2452 .2456
----------------------------------------------------
Solution Validates!
----------------------------------------------------
p690 Turbo
----------------------------------------------
Double precision appears to have 16 digits of accuracy
Assuming 8 bytes per DOUBLE PRECISION word
----------------------------------------------
Array size = 256000000
Offset = 512
The total memory requirement is 5859 MB
You are running each test 10 times
--
The *best* time for each test is used
*EXCLUDING* the first and last iterations
----------------------------------------------------
Your clock granularity appears to be less than one microsecond
Your clock granularity/precision appears to be 1 microseconds
----------------------------------------------------
Function Rate (MB/s) Avg time Min time Max time
Copy: 28610.8942 .1434 .1432 .1438
Scale: 28993.8226 .1415 .1413 .1416
Add: 32222.4249 .1909 .1907 .1911
Triad: 32248.5949 .1908 .1905 .1910
----------------------------------------------------
Solution Validates!
----------------------------------------------------
Sincerely,
Your evil twin brother....
--- John D. McCalpin, Ph.D. STSM, eServer Hardware Performance IBM - 11400 Burnet Road, MS 045-3N098 Austin, TX 78758 (512)838-6167 or tie line 678/6167 FAX (512)838-6486 or 678/6486
This archive was generated by hypermail 2.1.4 : Fri Apr 12 2002 - 07:11:35 CDT