From: Nahil Sobh (sobh@ncsa.uiuc.edu)
Date: Wed Dec 22 2004 - 10:41:07 CST
Dear John,
I have included our latest STREAM benchmark data for
NCSA's SGI-Altix Machine to update your STREAM ranking
list.
The machine is partitioned into two single-image systems
with 512p per system.
Peak Performance per partition: over 3 Tflops.
Processor: Intel Itanium 2 processors. 1.6 GHz 9 MB cache.
OS: Linux
One partition has 1 terabytes of globally accessible memory while
the other partition has 2 terabytes of globally accessible memory
Over 370 terabytes of SGI InfiniteStorage.
These runs were done by John Baron and Kevin McMahon from SGI.
The code was compiled -O3 -i8 -openmp -mP2OPT_hlo_prefetch=F.
The latter flag disables some unnecessary prefetches that the
compiler was generating.
Build 20040901 of the Intel 7.1 compiler was used.
====================================================
SGI Altix 3700 Bx2
512p 1.6GHz 9M
----------------------------------------------
Double precision appears to have 16 digits of accuracy
Assuming 8 bytes per DOUBLE PRECISION word
----------------------------------------------
Array size = 3290112000
Offset = 4351
The total memory requirement is 75304 MB
You are running each test 20 times
--
The *best* time for each test is used
*EXCLUDING* the first and last iterations
----------------------------------------------------
Your clock granularity appears to be less than one microsecond
Your clock granularity/precision appears to be 1 microseconds
----------------------------------------------------
Function Rate (MB/s) Avg time Min time Max time
Copy: 906388.1661 0.0582 0.0581 0.0586
Scale: 870211.2240 0.0608 0.0605 0.0613
Add: 1055179.1819 0.0750 0.0748 0.0753
Triad: 1119912.9516 0.0711 0.0705 0.0713
----------------------------------------------------
Solution Validates!
----------------------------------------------------
This was run using OMP_NUM_THREADS=510 and using dplace to bind the threads
to cpus 2-511.
The "tuned" version of the STREAM code was used; aside from some minor
formatting changes to handle all the digits for the large rates, the kernels
were modified using the following compiler prefetch intrinsics:
subroutine stream_copy (c, a, n)
real*8 c(*), a(*)
!$OMP PARALLEL DO
do j = 1,n
if (iand(j,15_8) .eq. 0) then
call lfetch_excl_nta(c(j+190))
call lfetch_nta(a(j+190))
end if
c(j) = a(j)
end do
end
subroutine stream_scale (b, c, scalar, n)
real*8 b(*), c(*), scalar
!$OMP PARALLEL DO
do j = 1,n
if (iand(j,15_8) .eq. 0) then
call lfetch_excl_nta(b(j+190))
call lfetch_nta(c(j+190))
end if
b(j) = scalar*c(j)
end do
end
subroutine stream_add (c, a, b, n)
real*8 c(*), a(*), b(*)
!$OMP PARALLEL DO
do j = 1,n
if (iand(j,15_8) .eq. 0) then
call lfetch_nta(b(j+190))
call lfetch_nta(a(j+190))
call lfetch_excl_nta(c(j+190))
end if
c(j) = a(j) + b(j)
end do
end
subroutine stream_triad (a, b, c, scalar, n)
real*8 a(*), b(*), c(*), scalar !$OMP PARALLEL DO
do j = 1,n
if (iand(j,15_8) .eq. 0) then
call lfetch_nta(b(j+190))
call lfetch_nta(c(j+190))
call lfetch_excl_nta(a(j+190))
end if
a(j) = b(j) + scalar*c(j)
end do
end
Nahil A. Sobh, Ph.D.
Senior Research Scientist and Group Leader
National Center for Supercomputing Applications (NCSA)
152 Computing Applications Building
605 East Springfield Avenue
Champaign, IL 61820
(217) 244 9481(office) 244 6400(Sec.) 244 6829(Fax)
This archive was generated by hypermail 2.1.4 : Sat Dec 25 2004 - 08:35:19 CST