Stream results for HP Integrity Superdome with 1500MHz,
6MB L3 cache Itanium2 processors:
16 cells, 64 1500MHz cpus, 256GB of memory (512x512MB DIMMs):
Function Rate (MB/s) Avg time Min time Max time
Copy: 82276.4205 0.0499 0.0498 0.0500
Scale: 81269.1622 0.0508 0.0504 0.0527
Add: 83036.8965 0.0748 0.0740 0.0767
Triad: 84048.9669 0.0733 0.0731 0.0735
16 cells, 32 1500MHz cpus, 256GB of memory (512x512MB DIMMs):
Function Rate (MB/s) Avg time Min time Max time
Copy: 81855.4008 0.0502 0.0500 0.0503
Scale: 80862.9293 0.0509 0.0507 0.0512
Add: 82734.0532 0.0753 0.0743 0.0764
Triad: 82352.5381 0.0750 0.0746 0.0758
8 cells, 32 1500MHz cpus, 128GB of memory (256x512MB DIMMs):
Function Rate (MB/s) Avg time Min time Max time
Copy: 41381.6738 0.0991 0.0990 0.0992
Scale: 40900.3209 0.1005 0.1001 0.1012
Add: 41559.1577 0.1485 0.1478 0.1491
Triad: 42134.5581 0.1461 0.1458 0.1466
8 cells, 16 1500MHz cpus, 128GB of memory (256x512MB DIMMs):
Function Rate (MB/s) Avg time Min time Max time
Copy: 40971.0368 0.1017 0.1000 0.1078
Scale: 40471.5630 0.1032 0.1012 0.1092
Add: 41149.4431 0.1530 0.1493 0.1626
Triad: 41142.8734 0.1509 0.1493 0.1560
4 cells, 16 1500MHz cpus, 64GB of memory (128x512MB DIMMs):
Function Rate (MB/s) Avg time Min time Max time
Copy: 20733.0610 0.1984 0.1976 0.2004
Scale: 20480.8295 0.2007 0.2000 0.2021
Add: 20786.1156 0.2966 0.2956 0.2990
Triad: 21052.8563 0.2928 0.2918 0.2951
The half populated configurations (2 cpus per cell) are fully
supported and orderable from HP.
The system was booted with half of the memory in each cell
configured as local memory. The system was running the HP-UX
11i v2 (11.23) TCOE. A patch (PHKL_30089) was installed which
improves performance of 64 way pthreaded applications by
approximately 10%.
The f90 version of the stream benchmark was compiled auto-parallel, with
the following changes (mysecond.c is a C routine that calls gettimeofday):
diff ../src/stream.ORIG/stream_d.f stream_d.f
63c63
< PARAMETER (n=2000000,offset=0,ndim=n+offset,ntimes=10)
--- > PARAMETER (n=256002800,offset=0,ndim=n+offset,ntimes=10) 72c72 < INTEGER bytes(4) --- > INTEGER*8 bytes(4) 90c90 < * COMMON a,b,c --- > COMMON a,b,c compiled as follows: cc +DSitanium2 +DD64 +O3 +Odataprefetch -Wl,+pd,64M -c mysecond.c f90 -o stream_d.mp +Ofaster +DSitanium2 -Wl,+pd,1M +DD64 +Oautopar +Onoopenmp +autodbl4 +extend_source +no ppu stream_d.f mysecond.o By default memory was allocted from local memory on a first-touch basis (setting the page size hint to 1MB via the +pd 1M linker option produces a better match between the granularity of allocation and the chunk size each thread works on). Here are the outputs for each configuration: 16 cells, 64 processors, 256 GB of memory (512x512MB DIMMs): ---------------------------------------------- Double precision appears to have 16 digits of accuracy Assuming 8 bytes per DOUBLE PRECISION word ---------------------------------------------- Array size = 256002800 Offset = 0 The total memory requirement is 5859 MB You are running each test 10 times -- The *best* time for each test is used *EXCLUDING* the first and last iterations ---------------------------------------------------- Your clock granularity/precision appears to be 1 microseconds ---------------------------------------------------- Function Rate (MB/s) Avg time Min time Max time Copy: 82276.4205 0.0499 0.0498 0.0500 Scale: 81269.1622 0.0508 0.0504 0.0527 Add: 83036.8965 0.0748 0.0740 0.0767 Triad: 84048.9669 0.0733 0.0731 0.0735 ---------------------------------------------------- Solution Validates! ---------------------------------------------------- 16 cells, 32 processors, 256 GB of memory (512x512MB DIMMs): ---------------------------------------------- Double precision appears to have 16 digits of accuracy Assuming 8 bytes per DOUBLE PRECISION word ---------------------------------------------- Array size = 256002800 Offset = 0 The total memory requirement is 5859 MB You are running each test 10 times -- The *best* time for each test is used *EXCLUDING* the first and last iterations ---------------------------------------------------- Your clock granularity/precision appears to be 1 microseconds ---------------------------------------------------- Function Rate (MB/s) Avg time Min time Max time Copy: 81855.4008 0.0502 0.0500 0.0503 Scale: 80862.9293 0.0509 0.0507 0.0512 Add: 82734.0532 0.0753 0.0743 0.0764 Triad: 82352.5381 0.0750 0.0746 0.0758 ---------------------------------------------------- Solution Validates! ---------------------------------------------------- 8 cells, 32 processors, 128 GB of memory (256x512MB DIMMs): ---------------------------------------------- Double precision appears to have 16 digits of accuracy Assuming 8 bytes per DOUBLE PRECISION word ---------------------------------------------- Array size = 256002800 Offset = 0 The total memory requirement is 5859 MB You are running each test 10 times -- The *best* time for each test is used *EXCLUDING* the first and last iterations ---------------------------------------------------- Your clock granularity/precision appears to be 1 microseconds ---------------------------------------------------- Function Rate (MB/s) Avg time Min time Max time Copy: 41381.6738 0.0991 0.0990 0.0992 Scale: 40900.3209 0.1005 0.1001 0.1012 Add: 41559.1577 0.1485 0.1478 0.1491 Triad: 42134.5581 0.1461 0.1458 0.1466 ---------------------------------------------------- Solution Validates! ---------------------------------------------------- 8 cells, 16 processors, 128 GB of memory (256x512MB DIMMs): ---------------------------------------------- Double precision appears to have 16 digits of accuracy Assuming 8 bytes per DOUBLE PRECISION word ---------------------------------------------- Array size = 256002800 Offset = 0 The total memory requirement is 5859 MB You are running each test 10 times -- The *best* time for each test is used *EXCLUDING* the first and last iterations ---------------------------------------------------- Your clock granularity/precision appears to be 1 microseconds ---------------------------------------------------- Function Rate (MB/s) Avg time Min time Max time Copy: 40971.0368 0.1017 0.1000 0.1078 Scale: 40471.5630 0.1032 0.1012 0.1092 Add: 41149.4431 0.1530 0.1493 0.1626 Triad: 41142.8734 0.1509 0.1493 0.1560 ---------------------------------------------------- Solution Validates! ---------------------------------------------------- 4 cells, 16 processors, 64 GB of memory (128x512MB DIMMs): ---------------------------------------------- Double precision appears to have 16 digits of accuracy Assuming 8 bytes per DOUBLE PRECISION word ---------------------------------------------- Array size = 256002800 Offset = 0 The total memory requirement is 5859 MB You are running each test 10 times -- The *best* time for each test is used *EXCLUDING* the first and last iterations ---------------------------------------------------- Your clock granularity/precision appears to be 1 microseconds ---------------------------------------------------- Function Rate (MB/s) Avg time Min time Max time Copy: 20733.0610 0.1984 0.1976 0.2004 Scale: 20480.8295 0.2007 0.2000 0.2021 Add: 20786.1156 0.2966 0.2956 0.2990 Triad: 21052.8563 0.2928 0.2918 0.2951 ---------------------------------------------------- Solution Validates! ----------------------------------------------------Received on Tue Mar 30 14:07:33 2004
This archive was generated by hypermail 2.1.8 : Sat Apr 03 2004 - 14:56:56 CST