From: Duc Vianney (dvianney@us.ibm.com)
Date: Sun Sep 12 2004 - 15:42:49 CDT
Here is the story:
The system contains two POWER5 chips, each with
two "processor cores", for a total of four physical "processor
cores". Each of these POWER5 "processor cores" is capable of
simultaneously executing two "threads". When running in this
"Simultaneous Multi-Threaded" (SMT) mode, the system appears
to have twice as many "cpus". I call these entities "logical
processors" to avoid confusion. The results below employed a
total of 8 OpenMP threads running on the 8 "logical processors"
that were running on the 4 "physical processors".
(I hope that is clear -- I cannot think of any other nomenclature that is less confusing.... Since I am a member of the IBM POWER5 design team, sometimes it is hard for me to step back and understand how other people use these words....)
It is inevitable that as computers get more complex, it will become more and more difficult to invent a labelling scheme that is not confusing or misleading.
The good news is that in this particular case, the results are only weakly dependent on the number of threads used. The IBM eServer p5 550 contains nearly identical hardware and gets its slightly better STREAM results using large (16 MB) pages with 4 OpenMP threads. When using the default small (4 kB) page size in Linux or AIX, more threads are needed to get enough concurrent outstanding cache misses to reach these asymptotic bandwidth results.
Comments and suggestions are welcome....
--------- end of STREAM Editor's Note 2004-09-13 ------------
Herein are Standard and Tuned STREAM results on IBM eServer OpenPower 720
(1650 MHz, 4CPU, Linux).
IBM eServer OpenPower 720 (1650 MHz, 4CPU, Linux)
Standard STREAM submission
Number of Threads = 8
----------------------------------------------
Double precision appears to have 16 digits of accuracy
Assuming 8 bytes per DOUBLE PRECISION word
----------------------------------------------
Array size = 66060288
Offset = 96
The total memory requirement is 1512 MB
You are running each test 100 times
The *best* time for each test is used
----------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds
The tests below will each take a time on the order
of 176185 microseconds
(= 176185 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
----------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
----------------------------------------------------
Function Rate (MB/s) RMS time Min time Max time
Copy: 5874.6379 .1894 .1799 .6120
Scale: 5783.2571 .1831 .1828 .1929
Add: 7439.3970 .2135 .2131 .2137
Triad: 7531.6010 .2108 .2105 .2110
Sum of a is = 0.537150969562697781E+126
Sum of b is = 0.107430193912641460E+126
Sum of c is = 0.143240258546879354E+126
locking to cpu 0
locking to cpu 1
locking to cpu 2
locking to cpu 3
locking to cpu 4
locking to cpu 5
locking to cpu 6
locking to cpu 7
IBM eServer OpenPower 720 (1650 MHz, 4CPU, Linux)
Tuned STREAM submission
Number of Threads = 8
----------------------------------------------
Double precision appears to have 16 digits of accuracy
Assuming 8 bytes per DOUBLE PRECISION word
----------------------------------------------
Array size = 66060288
Offset = 96
The total memory requirement is 1512 MB
You are running each test 100 times
The *best* time for each test is used
----------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds
The tests below will each take a time on the order
of 176451 microseconds
(= 176451 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
----------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
----------------------------------------------------
Function Rate (MB/s) RMS time Min time Max time
Copy: 6153.7267 .1831 .1718 .6521
Scale: 6014.3981 .1759 .1757 .1772
Add: 8610.8571 .1844 .1841 .1860
Triad: 8801.5882 .1804 .1801 .1807
Sum of a is = 0.537150969562697781E+126
Sum of b is = 0.107430193912641460E+126
Sum of c is = 0.143240258546879354E+126
locking to cpu 0
locking to cpu 1
locking to cpu 2
locking to cpu 3
locking to cpu 4
locking to cpu 5
locking to cpu 6
locking to cpu 7
Memory Info:
MemTotal: 32163332 kB
MemFree: 31981016 kB
Buffers: 24416 kB
Cached: 51656 kB
SwapCached: 0 kB
Active: 50408 kB
Inactive: 37492 kB
HighTotal: 0 kB
HighFree: 0 kB
LowTotal: 32163332 kB
LowFree: 31981016 kB
SwapTotal: 1048568 kB
SwapFree: 1048568 kB
Dirty: 144 kB
Writeback: 0 kB
Mapped: 15144 kB
Slab: 32596 kB
Committed_AS: 91820 kB
PageTables: 632 kB
VmallocTotal: 2147483647 kB
VmallocUsed: 4040 kB
VmallocChunk: 2147479015 kB
HugePages_Total: 0
HugePages_Free: 0
Hugepagesize: 16384 kB
CPU Info:
processor : 0
cpu : POWER5 (gr)
clock : 1656.000000MHz
revision : 2.1
processor : 1
cpu : POWER5 (gr)
clock : 1656.000000MHz
revision : 2.1
processor : 2
cpu : POWER5 (gr)
clock : 1656.000000MHz
revision : 2.1
processor : 3
cpu : POWER5 (gr)
clock : 1656.000000MHz
revision : 2.1
processor : 4
cpu : POWER5 (gr)
clock : 1656.000000MHz
revision : 2.1
processor : 5
cpu : POWER5 (gr)
clock : 1656.000000MHz
revision : 2.1
processor : 6
cpu : POWER5 (gr)
clock : 1656.000000MHz
revision : 2.1
processor : 7
cpu : POWER5 (gr)
clock : 1656.000000MHz
revision : 2.1
timebase : 207000000
machine : CHRP IBM,9124-720
Operating System Info:
SUSE LINUX Enterprise Server 9 for IBM POWER
Linux version 2.6.5-7.97-pseries64 (geeko@buildhost) (gcc version 3.3.3
(SuSE Linux)) #1 SMP Fri Jul 2 14:21:59 UTC 2004
Compiler Info:
XL Fortran Enterprise Edition Version 9.1 for Linux
Regards .. Duc.
Duc J Vianney, Ph. D., IBM Linux Technology Center Performance Team
dvianney@us.ibm.com, Phone: (512) 838-9919 Fax: (512) 838-0070
home page: http://www-124.ibm.com/developerworks/opensource/linuxperf/
project page: http://www-124.ibm.com/developerworks/projects/linuxperf
This archive was generated by hypermail 2.1.4 : Mon Sep 13 2004 - 08:37:09 CDT