STREAM results for AthlonMP 1.2GHz

From: C. Evangelinos (ce107@cfm.brown.edu)
Date: Thu Oct 25 2001 - 15:53:14 CDT

  • Next message: Christmann, Mark: "STREAM results submission for ES45-1000"

    Hi John - I hope you're well. Here are STREAM results from a dual
    Athlon machine at my old Brown haunts:

    Dual AthlonMP 1.2GHz with 896MB of visible DDR memory running Linux 2.4.9 SMP:
    processor : 0
    vendor_id : AuthenticAMD
    cpu family : 6
    model : 6
    model name : AMD Athlon(tm) Processor
    stepping : 1
    cpu MHz : 1194.693
    cache size : 256 KB
    fdiv_bug : no
    hlt_bug : no
    f00f_bug : no
    coma_bug : no
    fpu : yes
    fpu_exception : yes
    cpuid level : 1
    wp : yes
    flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse syscall mmxext 3dnowext 3dnow
    bogomips : 2385.51

    PGI 3.2.4 compilers

    Serial Fortran 5.0 code using etime timer
    pgf77 -fast -tp athlon -Minline -Mvect=assoc,prefetch,cachesize:327680 -Mcache_align stream_d.f second_cpu.f

    With really large arrays:

    ----------------------------------------------
     Double precision appears to have 16 digits of accuracy
     Assuming 8 bytes per DOUBLE PRECISION word
    ----------------------------------------------
     Array size = 24000000
     Offset = 0
     The total memory requirement is 549 MB
     You are running each test 10 times
     --
     The *best* time for each test is used
     *EXCLUDING* the first and last iterations
     ----------------------------------------------------
     Your clock granularity/precision appears to be 10000 microseconds
     ----------------------------------------------------
    Function Rate (MB/s) Avg time Min time Max time
    Copy: 619.3549 0.6237 0.6200 0.6300
    Scale: 698.1818 0.5512 0.5500 0.5600
    Add: 757.8948 0.7687 0.7600 0.7800
    Triad: 757.8948 0.7687 0.7600 0.7700
     ----------------------------------------------------
     Solution Validates!
     ----------------------------------------------------

    With smaller arrays (less trouble for the TLB maybe?):

    ----------------------------------------------
     Double precision appears to have 16 digits of accuracy
     Assuming 8 bytes per DOUBLE PRECISION word
    ----------------------------------------------
     Array size = 10000000
     Offset = 0
     The total memory requirement is 228 MB
     You are running each test 10 times
     --
     The *best* time for each test is used
     *EXCLUDING* the first and last iterations
     ----------------------------------------------------
     Your clock granularity/precision appears to be 10000 microseconds
     ----------------------------------------------------
    Function Rate (MB/s) Avg time Min time Max time
    Copy: 727.2727 0.2212 0.2200 0.2300
    Scale: 695.6522 0.2325 0.2300 0.2500
    Add: 827.5862 0.2975 0.2900 0.3000
    Triad: 827.5862 0.2950 0.2900 0.3000
     ----------------------------------------------------
     Solution Validates!
     ----------------------------------------------------

    Running 2 copies simultaneously:
    ----------------------------------------------
     Double precision appears to have 16 digits of accuracy
     Assuming 8 bytes per DOUBLE PRECISION word
    ----------------------------------------------
     Array size = 10000000
     Offset = 0
     The total memory requirement is 228 MB
     You are running each test 10 times
     --
     The *best* time for each test is used
     *EXCLUDING* the first and last iterations
     ----------------------------------------------------
     Your clock granularity/precision appears to be 10000 microseconds
     ----------------------------------------------------
    Function Rate (MB/s) Avg time Min time Max time
    Copy: 533.3333 0.3075 0.3000 0.3200
    Scale: 516.1290 0.3287 0.3100 0.3400
    Add: 585.3659 0.4225 0.4100 0.4300
    Triad: 600.0000 0.4162 0.4000 0.4300
     ----------------------------------------------------
     Solution Validates!
     ----------------------------------------------------
    ----------------------------------------------
     Double precision appears to have 16 digits of accuracy
     Assuming 8 bytes per DOUBLE PRECISION word
    ----------------------------------------------
     Array size = 10000000
     Offset = 0
     The total memory requirement is 228 MB
     You are running each test 10 times
     --
     The *best* time for each test is used
     *EXCLUDING* the first and last iterations
     ----------------------------------------------------
     Your clock granularity/precision appears to be 10000 microseconds
     ----------------------------------------------------
    Function Rate (MB/s) Avg time Min time Max time
    Copy: 533.3333 0.3137 0.3000 0.3200
    Scale: 516.1290 0.3312 0.3100 0.3400
    Add: 585.3659 0.4262 0.4100 0.4400
    Triad: 585.3659 0.4200 0.4100 0.4300
     ----------------------------------------------------
     Solution Validates!
     ----------------------------------------------------
    OpenMP parallel Fortran 5.0 code using gettimeofday timer
    pgcc -omp -fast -tp athlon -Minline -Mvect=assoc,prefetch,cachesize:327680 -Mcache_align -c second_wall_f.c
    pgf77 -mp -fast -tp athlon -Minline -Mvect=assoc,prefetch,cachesize:327680 -Mcache_align stream_d.f second_wall_f.o

    1 processor run
    ----------------------------------------------
     Double precision appears to have 16 digits of accuracy
     Assuming 8 bytes per DOUBLE PRECISION word
    ----------------------------------------------
     Array size = 10000000
     Offset = 0
     The total memory requirement is 228 MB
     You are running each test 10 times
     --
     The *best* time for each test is used
     *EXCLUDING* the first and last iterations
     ----------------------------------------------------
     Your clock granularity appears to be less than one microsecond
     Your clock granularity/precision appears to be 1 microseconds
     ----------------------------------------------------
    Function Rate (MB/s) Avg time Min time Max time
    Copy: 726.7869 0.2203 0.2201 0.2205
    Scale: 711.7692 0.2249 0.2248 0.2251
    Add: 860.0672 0.2791 0.2790 0.2793
    Triad: 851.4927 0.2820 0.2819 0.2821
     ----------------------------------------------------
     Solution Validates!
     ----------------------------------------------------
    10.710u 0.530s 0:11.24 100.0% 0+0k 0+0io 159pf+0w

    2 processor run
    ----------------------------------------------
     Double precision appears to have 16 digits of accuracy
     Assuming 8 bytes per DOUBLE PRECISION word
    ----------------------------------------------
     Array size = 10000000
     Offset = 0
     The total memory requirement is 228 MB
     You are running each test 10 times
     --
     The *best* time for each test is used
     *EXCLUDING* the first and last iterations
     ----------------------------------------------------
     Your clock granularity appears to be less than one microsecond
     Your clock granularity/precision appears to be 1 microseconds
     ----------------------------------------------------
    Function Rate (MB/s) Avg time Min time Max time
    Copy: 922.0414 0.1742 0.1735 0.1752
    Scale: 916.4276 0.1748 0.1746 0.1749
    Add: 1051.7321 0.2285 0.2282 0.2293
    Triad: 1053.3937 0.2283 0.2278 0.2299
     ----------------------------------------------------
     Solution Validates!
     ----------------------------------------------------
    16.880u 0.710s 0:08.99 195.6% 0+0k 0+0io 161pf+0w

    Serial Fortran (Grassl stream_offset.f) code using inline etime timer
    pgf77 -fast -tp athlon -Minline -Mvect=assoc,prefetch,cachesize:327680 -Mcache_align stream_offset.f

    With really large arrays: 677.993 634.261 797.109 776.132
    FORTRAN STOP

     STREAM benchmark
     ----------------
     Number of CPUs: 1
     Array size: 24001 Kwords
     Array padding: 20248 Words

      Aoff Boff Coff Assignment:Scaling: Summing: SAXPYing:
     -------------------------------------------------------------
         8 8 8 634.246 634.246 737.311 719.328
        16 8 8 634.246 634.246 737.311 719.328
        24 8 8 655.387 634.246 737.312 719.328
        32 8 8 655.387 634.246 737.312 719.328
         8 16 8 655.387 634.246 797.094 776.116
        16 16 8 655.387 634.246 797.094 776.116
        24 16 8 655.388 634.247 797.094 776.117
        32 16 8 655.388 634.247 797.094 776.117
         8 24 8 655.388 634.247 797.094 776.117
        16 24 8 655.388 634.247 797.094 776.117
        24 24 8 655.388 634.247 797.094 776.117
        32 24 8 655.388 634.247 797.094 776.117
         8 32 8 655.388 634.247 797.096 776.120
        16 32 8 655.388 634.247 797.096 776.120
        24 32 8 655.388 634.247 797.096 776.120
        32 32 8 655.388 634.247 797.096 776.120
         8 8 16 655.388 634.247 797.096 776.120
        16 8 16 655.388 634.248 797.096 776.120
        24 8 16 655.388 634.248 797.096 776.120
        32 8 16 655.389 634.248 797.096 776.120
         8 16 16 655.389 634.248 797.096 776.120
        16 16 16 655.389 634.248 797.096 776.120
        24 16 16 655.389 634.248 797.096 776.120
        32 16 16 655.389 634.253 797.096 776.120
         8 24 16 655.389 634.253 797.096 776.120
        16 24 16 655.389 634.253 797.100 776.124
        24 24 16 655.389 634.253 797.100 776.124
        32 24 16 655.389 634.253 797.100 776.124
         8 32 16 655.389 634.253 797.100 776.124
        16 32 16 655.389 634.253 797.100 776.124
        24 32 16 655.389 634.253 797.100 776.124
        32 32 16 655.389 634.253 797.100 776.124
         8 8 24 655.389 634.253 797.100 776.124
        16 8 24 655.389 634.253 797.100 776.124
        24 8 24 655.389 634.253 797.100 776.124
        32 8 24 655.389 634.253 797.100 776.124
         8 16 24 655.393 634.253 797.100 776.124
        16 16 24 655.393 634.253 797.100 776.124
        24 16 24 677.993 634.253 797.100 776.124
        32 16 24 677.993 634.253 797.100 776.124
         8 24 24 677.993 634.253 797.100 776.124
        16 24 24 677.993 634.253 797.100 776.124
        24 24 24 677.993 634.253 797.100 776.124
        32 24 24 677.993 634.253 797.100 776.124
         8 32 24 677.993 634.253 797.104 776.124
        16 32 24 677.993 634.253 797.104 776.124
        24 32 24 677.993 634.253 797.104 776.124
        32 32 24 677.993 634.253 797.104 776.124
         8 8 32 677.993 634.261 797.104 776.124
        16 8 32 677.993 634.261 797.109 776.132
        24 8 32 677.993 634.261 797.109 776.132
        32 8 32 677.993 634.261 797.109 776.132
         8 16 32 677.993 634.261 797.109 776.132
        16 16 32 677.993 634.261 797.109 776.132
        24 16 32 677.993 634.261 797.109 776.132
        32 16 32 677.993 634.261 797.109 776.132
         8 24 32 677.993 634.261 797.109 776.132
        16 24 32 677.993 634.261 797.109 776.132
        24 24 32 677.993 634.261 797.109 776.132
        32 24 32 677.993 634.261 797.109 776.132
         8 32 32 677.993 634.261 797.109 776.132
        16 32 32 677.993 634.261 797.109 776.132
        24 32 32 677.993 634.261 797.109 776.132
        32 32 32 677.993 634.261 797.109 776.132
    FORTRAN STOP
    184.270u 0.800s 3:05.13 99.9% 0+0k 0+0io 139pf+0w

    With smaller ones: 682.753 630.235 819.297 819.278
    FORTRAN STOP

     STREAM benchmark
     ----------------
     Number of CPUs: 1
     Array size: 10001 Kwords
     Array padding: 20248 Words

      Aoff Boff Coff Assignment:Scaling: Summing: SAXPYing:
     -------------------------------------------------------------
         8 8 8 630.217 630.217 722.896 722.896
        16 8 8 630.217 630.217 722.896 722.896
        24 8 8 682.735 630.217 722.896 722.896
        32 8 8 682.735 630.217 722.896 722.896
         8 16 8 682.735 630.217 819.281 768.077
        16 16 8 682.735 630.217 819.281 768.077
        24 16 8 682.735 630.218 819.283 768.078
        32 16 8 682.735 630.218 819.283 768.078
         8 24 8 682.735 630.219 819.283 768.078
        16 24 8 682.735 630.219 819.283 768.078
        24 24 8 682.735 630.219 819.283 768.078
        32 24 8 682.735 630.219 819.283 768.078
         8 32 8 682.735 630.219 819.283 768.078
        16 32 8 682.735 630.219 819.283 768.078
        24 32 8 682.735 630.219 819.283 768.081
        32 32 8 682.738 630.219 819.283 768.081
         8 8 16 682.738 630.219 819.283 768.081
        16 8 16 682.738 630.219 819.283 819.278
        24 8 16 682.738 630.219 819.283 819.278
        32 8 16 682.738 630.219 819.285 819.278
         8 16 16 682.738 630.219 819.285 819.278
        16 16 16 682.738 630.221 819.285 819.278
        24 16 16 682.738 630.221 819.285 819.278
        32 16 16 682.738 630.221 819.285 819.278
         8 24 16 682.738 630.221 819.285 819.278
        16 24 16 682.738 630.221 819.285 819.278
        24 24 16 682.738 630.223 819.285 819.278
        32 24 16 682.738 630.226 819.285 819.278
         8 32 16 682.738 630.226 819.285 819.278
        16 32 16 682.738 630.226 819.285 819.278
        24 32 16 682.738 630.226 819.285 819.278
        32 32 16 682.738 630.226 819.285 819.278
         8 8 24 682.738 630.226 819.285 819.278
        16 8 24 682.738 630.226 819.285 819.278
        24 8 24 682.743 630.226 819.285 819.278
        32 8 24 682.743 630.226 819.285 819.278
         8 16 24 682.743 630.226 819.285 819.278
        16 16 24 682.743 630.226 819.285 819.278
        24 16 24 682.743 630.226 819.285 819.278
        32 16 24 682.743 630.226 819.285 819.278
         8 24 24 682.743 630.226 819.285 819.278
        16 24 24 682.743 630.226 819.285 819.278
        24 24 24 682.743 630.226 819.285 819.278
        32 24 24 682.743 630.226 819.285 819.278
         8 32 24 682.743 630.226 819.291 819.278
        16 32 24 682.743 630.226 819.291 819.278
        24 32 24 682.743 630.226 819.291 819.278
        32 32 24 682.743 630.226 819.291 819.278
         8 8 32 682.743 630.226 819.291 819.278
        16 8 32 682.743 630.226 819.291 819.278
        24 8 32 682.743 630.226 819.291 819.278
        32 8 32 682.743 630.226 819.291 819.278
         8 16 32 682.743 630.226 819.291 819.278
        16 16 32 682.743 630.228 819.291 819.278
        24 16 32 682.743 630.228 819.291 819.278
        32 16 32 682.743 630.228 819.291 819.278
         8 24 32 682.743 630.228 819.291 819.278
        16 24 32 682.743 630.228 819.291 819.278
        24 24 32 682.743 630.232 819.291 819.278
        32 24 32 682.743 630.235 819.297 819.278
         8 32 32 682.753 630.235 819.297 819.278
        16 32 32 682.753 630.235 819.297 819.278
        24 32 32 682.753 630.235 819.297 819.278
        32 32 32 682.753 630.235 819.297 819.278
    76.890u 0.260s 1:17.15 100.0% 0+0k 0+0io 141pf+0w

    I didn't waste more time to test the stream_d.c & stream_l.cpp
    versions of the benchmark or the GNU compilers as in my experience
    they are slower in STREAM.

    I cannot reproduce numbers any closer to the ones that you have for
    single processor Athlons at 1.2GHz (same chipset AMD 760) but were
    obtained from a Tbird Athlon without the prefetch engine that the
    Athlon4/MP/XP has. Either the Compaq Visual Fortran compiler produces
    better (prefetching?) code or the memory configuration is different
    despite the fact that in both cases it is DDR.

    At least going from 1 to 2 procs the achievable bandwidth increases
    somewhat.

    Dr. Constantinos Evangelinos
    Ocean Engineering
    MIT



    This archive was generated by hypermail 2b29 : Wed Oct 31 2001 - 11:26:48 CST