tuned STREAM on IBM eServer p5 550 Express (1500 MHz, 4 cpu)

From: Frank Johnston (fjohn@us.ibm.com)
Date: Mon Oct 04 2004 - 16:20:55 CDT

  • Next message: Rico Pajarola: "Stream - Sun Fire V210"

    These are tuned STREAM results on a IBM eServer p5 550 Express
    with four 1500 MHz cpus (36MB L3 cache). This is a POWER5 SMP machine.
    Large pages were used in all cases.

    Function Rate (MB/s) RMS time Min time Max time
    Copy: 6336.64 .17 .17 .19
    Scale: 6203.58 .17 .17 .17
    Add: 9044.43 .18 .18 .18
    Triad: 9260.01 .17 .17 .17

    Here is the full output file:
    ------------------------------------------------------------
     Requesting Large Pages
     Setting up for 2 CPUs per module
     Number of segments per array = 2
     CPU binding list : 0 2
     Shared Segment Pointer = 504403158265495552
     Shared Segment Pointer = 504403158802366464
     Shared Segment Pointer = 504403159339237376
     Segment Size (B) = 268435456 (MB = 256 )
     Array Size (B) = 536870912 (MB = 512 )
     Array Size (DW) = 67108864
     Num_threads = 4
     Num_threads = 4
     Num_threads = 4
     Num_threads = 4
     rebind: num_parthds is 4
     Starting Initialization
     Done With Initialization
     a(1) 1.00000000000000000
     b(M) 1.00000000000000000
     c(M) 1.00000000000000000
     Incremental Offset = 512
     Number of Threads = 4
    ----------------------------------------------
     Double precision appears to have 16 digits of accuracy
     Assuming 8 bytes per DOUBLE PRECISION word
    ----------------------------------------------
     Array size = 67079168
     Offset = 0
     The total memory requirement is 1535 MB
     You are running each test 5 times
     The *best* time for each test is used
     ----------------------------------------------------
     Your clock granularity appears to be less than one microsecond
     Your clock granularity/precision appears to be 1 microseconds
     The tests below will each take a time on the order
     of 173001 microseconds
        (= 173001 clock ticks)
     Increase the size of the arrays if this shows that
     you are not getting at least 20 clock ticks per test.
     ----------------------------------------------------
     WARNING -- The above is only a rough guideline.
     For best results, please be sure you know the
     precision of your system timer.
     ----------------------------------------------------
    Function Rate (MB/s) RMS time Min time Max time
    Copy: 6332.81 .17 .17 .19
    Scale: 6202.76 .17 .17 .17
    Add: 9044.40 .18 .18 .18
    Triad: 9241.50 .17 .17 .17
     Sum of a is = 101876486400000.000
     Sum of b is = 20375297280000.0000
     Sum of c is = 27167063040000.0000
     Incremental Offset = 1536
     Number of Threads = 4
    ----------------------------------------------
     Double precision appears to have 16 digits of accuracy
     Assuming 8 bytes per DOUBLE PRECISION word
    ----------------------------------------------
     Array size = 67079168
     Offset = 0
     The total memory requirement is 1535 MB
     You are running each test 5 times
     The *best* time for each test is used
     ----------------------------------------------------
     Your clock granularity appears to be less than one microsecond
     Your clock granularity/precision appears to be 1 microseconds
     The tests below will each take a time on the order
     of 173075 microseconds
        (= 173075 clock ticks)
     Increase the size of the arrays if this shows that
     you are not getting at least 20 clock ticks per test.
     ----------------------------------------------------
     WARNING -- The above is only a rough guideline.
     For best results, please be sure you know the
     precision of your system timer.
     ----------------------------------------------------
    Function Rate (MB/s) RMS time Min time Max time
    Copy: 6334.93 .17 .17 .19
    Scale: 6202.84 .17 .17 .17
    Add: 9041.60 .18 .18 .18
    Triad: 9244.06 .17 .17 .17
     Sum of a is = 101876486400000.000
     Sum of b is = 20375297280000.0000
     Sum of c is = 27167063040000.0000
     Incremental Offset = 2560
     Number of Threads = 4
    ----------------------------------------------
     Double precision appears to have 16 digits of accuracy
     Assuming 8 bytes per DOUBLE PRECISION word
    ----------------------------------------------
     Array size = 67079168
     Offset = 0
     The total memory requirement is 1535 MB
     You are running each test 5 times
     The *best* time for each test is used
     ----------------------------------------------------
     Your clock granularity appears to be less than one microsecond
     Your clock granularity/precision appears to be 1 microseconds
     The tests below will each take a time on the order
     of 173063 microseconds
        (= 173063 clock ticks)
     Increase the size of the arrays if this shows that
     you are not getting at least 20 clock ticks per test.
     ----------------------------------------------------
     WARNING -- The above is only a rough guideline.
     For best results, please be sure you know the
     precision of your system timer.
     ----------------------------------------------------
    Function Rate (MB/s) RMS time Min time Max time
    Copy: 6335.98 .17 .17 .19
    Scale: 6202.82 .17 .17 .17
    Add: 9037.42 .18 .18 .18
    Triad: 9237.77 .17 .17 .17
     Sum of a is = 101876486400000.000
     Sum of b is = 20375297280000.0000
     Sum of c is = 27167063040000.0000
     Incremental Offset = 512
     Number of Threads = 4
    ----------------------------------------------
     Double precision appears to have 16 digits of accuracy
     Assuming 8 bytes per DOUBLE PRECISION word
    ----------------------------------------------
     Array size = 67077120
     Offset = 0
     The total memory requirement is 1535 MB
     You are running each test 5 times
     The *best* time for each test is used
     ----------------------------------------------------
     Your clock granularity appears to be less than one microsecond
     Your clock granularity/precision appears to be 1 microseconds
     The tests below will each take a time on the order
     of 173122 microseconds
        (= 173122 clock ticks)
     Increase the size of the arrays if this shows that
     you are not getting at least 20 clock ticks per test.
     ----------------------------------------------------
     WARNING -- The above is only a rough guideline.
     For best results, please be sure you know the
     precision of your system timer.
     ----------------------------------------------------
    Function Rate (MB/s) RMS time Min time Max time
    Copy: 6336.64 .17 .17 .19
    Scale: 6203.58 .17 .17 .17
    Add: 9044.43 .18 .18 .18
    Triad: 9260.01 .17 .17 .17
     Sum of a is = 101873376000000.000
     Sum of b is = 20374675200000.0000
     Sum of c is = 27166233600000.0000
     Incremental Offset = 1536
     Number of Threads = 4
    ----------------------------------------------
     Double precision appears to have 16 digits of accuracy
     Assuming 8 bytes per DOUBLE PRECISION word
    ----------------------------------------------
     Array size = 67077120
     Offset = 0
     The total memory requirement is 1535 MB
     You are running each test 5 times
     The *best* time for each test is used
     ----------------------------------------------------
     Your clock granularity appears to be less than one microsecond
     Your clock granularity/precision appears to be 1 microseconds
     The tests below will each take a time on the order
     of 173064 microseconds
        (= 173064 clock ticks)
     Increase the size of the arrays if this shows that
     you are not getting at least 20 clock ticks per test.
     ----------------------------------------------------
     WARNING -- The above is only a rough guideline.
     For best results, please be sure you know the
     precision of your system timer.
     ----------------------------------------------------
    Function Rate (MB/s) RMS time Min time Max time
    Copy: 6333.57 .17 .17 .19
    Scale: 6203.03 .17 .17 .17
    Add: 9042.34 .18 .18 .18
    Triad: 9237.94 .17 .17 .17
     Sum of a is = 101873376000000.000
     Sum of b is = 20374675200000.0000
     Sum of c is = 27166233600000.0000
     Incremental Offset = 2560
     Number of Threads = 4
    ----------------------------------------------
     Double precision appears to have 16 digits of accuracy
     Assuming 8 bytes per DOUBLE PRECISION word
    ----------------------------------------------
     Array size = 67077120
     Offset = 0
     The total memory requirement is 1535 MB
     You are running each test 5 times
     The *best* time for each test is used
     ----------------------------------------------------
     Your clock granularity appears to be less than one microsecond
     Your clock granularity/precision appears to be 1 microseconds
     The tests below will each take a time on the order
     of 173082 microseconds
        (= 173082 clock ticks)
     Increase the size of the arrays if this shows that
     you are not getting at least 20 clock ticks per test.
     ----------------------------------------------------
     WARNING -- The above is only a rough guideline.
     For best results, please be sure you know the
     precision of your system timer.
     ----------------------------------------------------
    Function Rate (MB/s) RMS time Min time Max time
    Copy: 6336.72 .17 .17 .19
    Scale: 6204.31 .17 .17 .17
    Add: 9038.50 .18 .18 .18
    Triad: 9235.56 .17 .17 .17
     Sum of a is = 101873376000000.000
     Sum of b is = 20374675200000.0000
     Sum of c is = 27166233600000.0000
     Incremental Offset = 512
     Number of Threads = 4
    ----------------------------------------------
     Double precision appears to have 16 digits of accuracy
     Assuming 8 bytes per DOUBLE PRECISION word
    ----------------------------------------------
     Array size = 67075072
     Offset = 0
     The total memory requirement is 1535 MB
     You are running each test 5 times
     The *best* time for each test is used
     ----------------------------------------------------
     Your clock granularity appears to be less than one microsecond
     Your clock granularity/precision appears to be 1 microseconds
     The tests below will each take a time on the order
     of 173082 microseconds
        (= 173082 clock ticks)
     Increase the size of the arrays if this shows that
     you are not getting at least 20 clock ticks per test.
     ----------------------------------------------------
     WARNING -- The above is only a rough guideline.
     For best results, please be sure you know the
     precision of your system timer.
     ----------------------------------------------------
    Function Rate (MB/s) RMS time Min time Max time
    Copy: 6334.12 .17 .17 .19
    Scale: 6202.30 .17 .17 .17
    Add: 9043.60 .18 .18 .18
    Triad: 9253.06 .17 .17 .18
     Sum of a is = 101870265600000.000
     Sum of b is = 20374053120000.0000
     Sum of c is = 27165404160000.0000
     Incremental Offset = 1536
     Number of Threads = 4
    ----------------------------------------------
     Double precision appears to have 16 digits of accuracy
     Assuming 8 bytes per DOUBLE PRECISION word
    ----------------------------------------------
     Array size = 67075072
     Offset = 0
     The total memory requirement is 1535 MB
     You are running each test 5 times
     The *best* time for each test is used
     ----------------------------------------------------
     Your clock granularity appears to be less than one microsecond
     Your clock granularity/precision appears to be 1 microseconds
     The tests below will each take a time on the order
     of 173061 microseconds
        (= 173061 clock ticks)
     Increase the size of the arrays if this shows that
     you are not getting at least 20 clock ticks per test.
     ----------------------------------------------------
     WARNING -- The above is only a rough guideline.
     For best results, please be sure you know the
     precision of your system timer.
     ----------------------------------------------------
    Function Rate (MB/s) RMS time Min time Max time
    Copy: 6333.75 .17 .17 .19
    Scale: 6203.76 .17 .17 .17
    Add: 9043.13 .18 .18 .18
    Triad: 9247.27 .17 .17 .17
     Sum of a is = 101870265600000.000
     Sum of b is = 20374053120000.0000
     Sum of c is = 27165404160000.0000
     Incremental Offset = 2560
     Number of Threads = 4
    ----------------------------------------------
     Double precision appears to have 16 digits of accuracy
     Assuming 8 bytes per DOUBLE PRECISION word
    ----------------------------------------------
     Array size = 67075072
     Offset = 0
     The total memory requirement is 1535 MB
     You are running each test 5 times
     The *best* time for each test is used
     ----------------------------------------------------
     Your clock granularity appears to be less than one microsecond
     Your clock granularity/precision appears to be 1 microseconds
     The tests below will each take a time on the order
     of 173057 microseconds
        (= 173057 clock ticks)
     Increase the size of the arrays if this shows that
     you are not getting at least 20 clock ticks per test.
     ----------------------------------------------------
     WARNING -- The above is only a rough guideline.
     For best results, please be sure you know the
     precision of your system timer.
     ----------------------------------------------------
    Function Rate (MB/s) RMS time Min time Max time
    Copy: 6334.65 .17 .17 .19
    Scale: 6203.77 .17 .17 .17
    Add: 9041.13 .18 .18 .18
    Triad: 9241.83 .17 .17 .17
     Sum of a is = 101870265600000.000
     Sum of b is = 20374053120000.0000
     Sum of c is = 27165404160000.0000
     Incremental Offset = 512
     Number of Threads = 4
    ----------------------------------------------
     Double precision appears to have 16 digits of accuracy
     Assuming 8 bytes per DOUBLE PRECISION word
    ----------------------------------------------
     Array size = 67073024
     Offset = 0
     The total memory requirement is 1535 MB
     You are running each test 5 times
     The *best* time for each test is used
     ----------------------------------------------------
     Your clock granularity appears to be less than one microsecond
     Your clock granularity/precision appears to be 1 microseconds
     The tests below will each take a time on the order
     of 173041 microseconds
        (= 173041 clock ticks)
     Increase the size of the arrays if this shows that
     you are not getting at least 20 clock ticks per test.
     ----------------------------------------------------
     WARNING -- The above is only a rough guideline.
     For best results, please be sure you know the
     precision of your system timer.
     ----------------------------------------------------
    Function Rate (MB/s) RMS time Min time Max time
    Copy: 6334.66 .17 .17 .19
    Scale: 6202.59 .17 .17 .17
    Add: 9042.47 .18 .18 .18
    Triad: 9261.63 .17 .17 .17
     Sum of a is = 101867155200000.000
     Sum of b is = 20373431040000.0000
     Sum of c is = 27164574720000.0000
     Incremental Offset = 1536
     Number of Threads = 4
    ----------------------------------------------
     Double precision appears to have 16 digits of accuracy
     Assuming 8 bytes per DOUBLE PRECISION word
    ----------------------------------------------
     Array size = 67073024
     Offset = 0
     The total memory requirement is 1535 MB
     You are running each test 5 times
     The *best* time for each test is used
     ----------------------------------------------------
     Your clock granularity appears to be less than one microsecond
     Your clock granularity/precision appears to be 1 microseconds
     The tests below will each take a time on the order
     of 173152 microseconds
        (= 173152 clock ticks)
     Increase the size of the arrays if this shows that
     you are not getting at least 20 clock ticks per test.
     ----------------------------------------------------
     WARNING -- The above is only a rough guideline.
     For best results, please be sure you know the
     precision of your system timer.
     ----------------------------------------------------
    Function Rate (MB/s) RMS time Min time Max time
    Copy: 6334.58 .17 .17 .19
    Scale: 6204.01 .17 .17 .17
    Add: 9043.97 .18 .18 .18
    Triad: 9243.32 .17 .17 .17
     Sum of a is = 101867155200000.000
     Sum of b is = 20373431040000.0000
     Sum of c is = 27164574720000.0000
     Incremental Offset = 2560
     Number of Threads = 4
    ----------------------------------------------
     Double precision appears to have 16 digits of accuracy
     Assuming 8 bytes per DOUBLE PRECISION word
    ----------------------------------------------
     Array size = 67073024
     Offset = 0
     The total memory requirement is 1535 MB
     You are running each test 5 times
     The *best* time for each test is used
     ----------------------------------------------------
     Your clock granularity appears to be less than one microsecond
     Your clock granularity/precision appears to be 1 microseconds
     The tests below will each take a time on the order
     of 172962 microseconds
        (= 172962 clock ticks)
     Increase the size of the arrays if this shows that
     you are not getting at least 20 clock ticks per test.
     ----------------------------------------------------
     WARNING -- The above is only a rough guideline.
     For best results, please be sure you know the
     precision of your system timer.
     ----------------------------------------------------
    Function Rate (MB/s) RMS time Min time Max time
    Copy: 6337.22 .17 .17 .19
    Scale: 6201.97 .17 .17 .17
    Add: 9035.72 .18 .18 .18
    Triad: 9238.51 .17 .17 .17
     Sum of a is = 101867155200000.000
     Sum of b is = 20373431040000.0000
     Sum of c is = 27164574720000.0000
    GETSHRSEG: requesting large pages
    GETSHRSEG ENTRY: shmgetflag -2147481216
    bindprocessor successful: thread_self() 659609 cpu_id 0
    bindprocessor successful: thread_self() 659609 cpu_id 2
    GETSHRSEG: requesting large pages
    GETSHRSEG ENTRY: shmgetflag -2147481216
    bindprocessor successful: thread_self() 659609 cpu_id 0
    bindprocessor successful: thread_self() 659609 cpu_id 2
    GETSHRSEG: requesting large pages
    GETSHRSEG ENTRY: shmgetflag -2147481216
    bindprocessor successful: thread_self() 659609 cpu_id 0
    bindprocessor successful: thread_self() 659609 cpu_id 2
    bindprocessor successful: thread_self() 803005 cpu_id 2
    bindprocessor successful: thread_self() 815321 cpu_id 3
    bindprocessor successful: thread_self() 786609 cpu_id 1
    bindprocessor successful: thread_self() 659609 cpu_id 0



    This archive was generated by hypermail 2.1.4 : Tue Oct 05 2004 - 07:49:38 CDT