Dell R630 with 2 Xeon E5-2660 v3 (10 core, 2.6 GHz, 105W) & 64 GiB of DDR4/2133 (one dual-rank 16 GiB DIMM per channel).
Compiled with icc 2015: -O3 -xCORE-AVX2 -ffreestanding -openmp -DSTREAM_ARRAY_SIZE=400000000
Summary:
Copy 109124 MB/s
Scale 109634 MB/s
Add 111862 MB/s
Triad 111760 MB/s
Notes:
* The Uncore Frequency was set to "Maximum" in the BIOS.
* C states are enabled.
* "Energy Efficient Turbo" was disabled.
* This mode limits the maximum Turbo frequency based on a "return on investment" CPI measurement.
* For STREAM, "Energy Efficient Turbo" limits core frequencies to 2.6 GHz (vs 2.9 GHz) and reduces performance by slightly less than 1%.
* Not all combinations of BIOS settings were tested -- I picked settings to maximize memory performance, control, and reproducibility, with much less concern for energy efficiency.
* Turbo mode was enabled -- the cores ran at 2.9 GHz (max all-core Turbo frequency when using 256-bit registers)
* Alternate Configurations:
* Performance was slightly higher (~1%) when using 10-12 cores instead of all 20.
* Performance was only ~1.7% lower when using 8 cores (4 per socket). This case gave the best energy efficiency.
* When using all cores, performance was almost completely independent of frequency -- Triad results were only reduced by 1% at the minimum supported frequency of 1.2 GHz.
* When using fewer than all cores:
* 8-core (4 per socket) performance was somewhat dependent on frequency -- Triad results dropped by ~15% at 1.2GHz.
* 12-core (6 per socket) performance was very weakly dependent on frequency -- Triad results dropped by less than 3% at 1.2 GHz (within 2% of 20-core results at 2.9 GHz).
Details:
~/WorkSpace/SystemMirrors/Discovery2/STREAM/Results/2015-04-22/DDR4_2133:2015-10-13T15:30:46 $ more log.scatter.2601000.20p.AVX2
OMP: Info #204: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #202: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19}
OMP: Info #156: KMP_AFFINITY: 20 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 2 packages x 10 cores/pkg x 1 threads/core (20 total cores)
OMP: Info #206: KMP_AFFINITY: OS proc to physical thread map:
OMP: Info #171: KMP_AFFINITY: OS proc 0 maps to package 0 core 0
OMP: Info #171: KMP_AFFINITY: OS proc 2 maps to package 0 core 1
OMP: Info #171: KMP_AFFINITY: OS proc 4 maps to package 0 core 2
OMP: Info #171: KMP_AFFINITY: OS proc 6 maps to package 0 core 3
OMP: Info #171: KMP_AFFINITY: OS proc 8 maps to package 0 core 4
OMP: Info #171: KMP_AFFINITY: OS proc 10 maps to package 0 core 8
OMP: Info #171: KMP_AFFINITY: OS proc 12 maps to package 0 core 9
OMP: Info #171: KMP_AFFINITY: OS proc 14 maps to package 0 core 10
OMP: Info #171: KMP_AFFINITY: OS proc 16 maps to package 0 core 11
OMP: Info #171: KMP_AFFINITY: OS proc 18 maps to package 0 core 12
OMP: Info #171: KMP_AFFINITY: OS proc 1 maps to package 1 core 0
OMP: Info #171: KMP_AFFINITY: OS proc 3 maps to package 1 core 1
OMP: Info #171: KMP_AFFINITY: OS proc 5 maps to package 1 core 2
OMP: Info #171: KMP_AFFINITY: OS proc 7 maps to package 1 core 3
OMP: Info #171: KMP_AFFINITY: OS proc 9 maps to package 1 core 4
OMP: Info #171: KMP_AFFINITY: OS proc 11 maps to package 1 core 8
OMP: Info #171: KMP_AFFINITY: OS proc 13 maps to package 1 core 9
OMP: Info #171: KMP_AFFINITY: OS proc 15 maps to package 1 core 10
OMP: Info #171: KMP_AFFINITY: OS proc 17 maps to package 1 core 11
OMP: Info #171: KMP_AFFINITY: OS proc 19 maps to package 1 core 12
OMP: Info #242: KMP_AFFINITY: pid 18881 thread 0 bound to OS proc set {0}
OMP: Info #242: KMP_AFFINITY: pid 18881 thread 1 bound to OS proc set {1}
OMP: Info #242: KMP_AFFINITY: pid 18881 thread 2 bound to OS proc set {2}
OMP: Info #242: KMP_AFFINITY: pid 18881 thread 3 bound to OS proc set {3}
OMP: Info #242: KMP_AFFINITY: pid 18881 thread 4 bound to OS proc set {4}
OMP: Info #242: KMP_AFFINITY: pid 18881 thread 5 bound to OS proc set {5}
OMP: Info #242: KMP_AFFINITY: pid 18881 thread 7 bound to OS proc set {7}
OMP: Info #242: KMP_AFFINITY: pid 18881 thread 6 bound to OS proc set {6}
OMP: Info #242: KMP_AFFINITY: pid 18881 thread 10 bound to OS proc set {10}
OMP: Info #242: KMP_AFFINITY: pid 18881 thread 9 bound to OS proc set {9}
OMP: Info #242: KMP_AFFINITY: pid 18881 thread 8 bound to OS proc set {8}
OMP: Info #242: KMP_AFFINITY: pid 18881 thread 11 bound to OS proc set {11}
OMP: Info #242: KMP_AFFINITY: pid 18881 thread 13 bound to OS proc set {13}
OMP: Info #242: KMP_AFFINITY: pid 18881 thread 12 bound to OS proc set {12}
OMP: Info #242: KMP_AFFINITY: pid 18881 thread 14 bound to OS proc set {14}
OMP: Info #242: KMP_AFFINITY: pid 18881 thread 15 bound to OS proc set {15}
OMP: Info #242: KMP_AFFINITY: pid 18881 thread 16 bound to OS proc set {16}
OMP: Info #242: KMP_AFFINITY: pid 18881 thread 17 bound to OS proc set {17}
OMP: Info #242: KMP_AFFINITY: pid 18881 thread 18 bound to OS proc set {18}
OMP: Info #242: KMP_AFFINITY: pid 18881 thread 19 bound to OS proc set {19}
-------------------------------------------------------------
STREAM version $Revision: 1.4 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 400000000, Offset = 0
Total memory required = 9155.3 MiB.
Each test is run 20 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Number of Threads requested = 20
Number of Threads counted = 20
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 63379 microseconds.
(= 63379 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Rate (MB/s) Avg time Min time Max time
Copy: 109099.2599 0.0588 0.0587 0.0597
Scale: 109233.7782 0.0587 0.0586 0.0591
Add: 111869.8591 0.0859 0.0858 0.0859
Triad: 111683.9932 0.0860 0.0860 0.0861
-------------------------------------------------------------
Solution Validates: avg error less than 1e-15 on all three arrays
-------------------------------------------------------------
Performance counter stats for './stream.omp.AVX2':
128298.046218 task-clock # 18.909 CPUs utilized
846 context-switches # 0.007 K/sec
35 cpu-migrations # 0.000 K/sec
6,369 page-faults # 0.050 K/sec
371,606,234,043 cycles # 2.896 GHz
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
77,311,273,853 instructions # 0.21 insns per cycle
14,561,782,675 branches # 113.500 M/sec
2,191,532 branch-misses # 0.02% of all branches
6.784942866 seconds time elapsed
OMP: Info #204: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #202: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19}
OMP: Info #156: KMP_AFFINITY: 20 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 2 packages x 10 cores/pkg x 1 threads/core (20 total cores)
OMP: Info #206: KMP_AFFINITY: OS proc to physical thread map:
OMP: Info #171: KMP_AFFINITY: OS proc 0 maps to package 0 core 0
OMP: Info #171: KMP_AFFINITY: OS proc 2 maps to package 0 core 1
OMP: Info #171: KMP_AFFINITY: OS proc 4 maps to package 0 core 2
OMP: Info #171: KMP_AFFINITY: OS proc 6 maps to package 0 core 3
OMP: Info #171: KMP_AFFINITY: OS proc 8 maps to package 0 core 4
OMP: Info #171: KMP_AFFINITY: OS proc 10 maps to package 0 core 8
OMP: Info #171: KMP_AFFINITY: OS proc 12 maps to package 0 core 9
OMP: Info #171: KMP_AFFINITY: OS proc 14 maps to package 0 core 10
OMP: Info #171: KMP_AFFINITY: OS proc 16 maps to package 0 core 11
OMP: Info #171: KMP_AFFINITY: OS proc 18 maps to package 0 core 12
OMP: Info #171: KMP_AFFINITY: OS proc 1 maps to package 1 core 0
OMP: Info #171: KMP_AFFINITY: OS proc 3 maps to package 1 core 1
OMP: Info #171: KMP_AFFINITY: OS proc 5 maps to package 1 core 2
OMP: Info #171: KMP_AFFINITY: OS proc 7 maps to package 1 core 3
OMP: Info #171: KMP_AFFINITY: OS proc 9 maps to package 1 core 4
OMP: Info #171: KMP_AFFINITY: OS proc 11 maps to package 1 core 8
OMP: Info #171: KMP_AFFINITY: OS proc 13 maps to package 1 core 9
OMP: Info #171: KMP_AFFINITY: OS proc 15 maps to package 1 core 10
OMP: Info #171: KMP_AFFINITY: OS proc 17 maps to package 1 core 11
OMP: Info #171: KMP_AFFINITY: OS proc 19 maps to package 1 core 12
OMP: Info #242: KMP_AFFINITY: pid 18903 thread 0 bound to OS proc set {0}
OMP: Info #242: KMP_AFFINITY: pid 18903 thread 1 bound to OS proc set {1}
OMP: Info #242: KMP_AFFINITY: pid 18903 thread 3 bound to OS proc set {3}
OMP: Info #242: KMP_AFFINITY: pid 18903 thread 2 bound to OS proc set {2}
OMP: Info #242: KMP_AFFINITY: pid 18903 thread 4 bound to OS proc set {4}
OMP: Info #242: KMP_AFFINITY: pid 18903 thread 5 bound to OS proc set {5}
OMP: Info #242: KMP_AFFINITY: pid 18903 thread 7 bound to OS proc set {7}
OMP: Info #242: KMP_AFFINITY: pid 18903 thread 6 bound to OS proc set {6}
OMP: Info #242: KMP_AFFINITY: pid 18903 thread 8 bound to OS proc set {8}
OMP: Info #242: KMP_AFFINITY: pid 18903 thread 9 bound to OS proc set {9}
OMP: Info #242: KMP_AFFINITY: pid 18903 thread 10 bound to OS proc set {10}
OMP: Info #242: KMP_AFFINITY: pid 18903 thread 11 bound to OS proc set {11}
OMP: Info #242: KMP_AFFINITY: pid 18903 thread 12 bound to OS proc set {12}
OMP: Info #242: KMP_AFFINITY: pid 18903 thread 13 bound to OS proc set {13}
OMP: Info #242: KMP_AFFINITY: pid 18903 thread 14 bound to OS proc set {14}
OMP: Info #242: KMP_AFFINITY: pid 18903 thread 15 bound to OS proc set {15}
OMP: Info #242: KMP_AFFINITY: pid 18903 thread 17 bound to OS proc set {17}
OMP: Info #242: KMP_AFFINITY: pid 18903 thread 16 bound to OS proc set {16}
OMP: Info #242: KMP_AFFINITY: pid 18903 thread 18 bound to OS proc set {18}
OMP: Info #242: KMP_AFFINITY: pid 18903 thread 19 bound to OS proc set {19}
-------------------------------------------------------------
STREAM version $Revision: 1.4 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 400000000, Offset = 0
Total memory required = 9155.3 MiB.
Each test is run 20 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Number of Threads requested = 20
Number of Threads counted = 20
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 63250 microseconds.
(= 63250 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Rate (MB/s) Avg time Min time Max time
Copy: 109123.6528 0.0588 0.0586 0.0589
Scale: 109633.9575 0.0585 0.0584 0.0586
Add: 111862.0894 0.0859 0.0858 0.0860
Triad: 111760.2507 0.0860 0.0859 0.0861
-------------------------------------------------------------
Solution Validates: avg error less than 1e-15 on all three arrays
-------------------------------------------------------------
Performance counter stats for './stream.omp.AVX2':
128286.180255 task-clock # 18.873 CPUs utilized
734 context-switches # 0.006 K/sec
27 cpu-migrations # 0.000 K/sec
6,355 page-faults # 0.050 K/sec
371,867,953,614 cycles # 2.899 GHz
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
77,561,980,423 instructions # 0.21 insns per cycle
14,638,892,990 branches # 114.111 M/sec
2,146,510 branch-misses # 0.01% of all branches
6.797216182 seconds time elapsed
OMP: Info #204: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #202: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19}
OMP: Info #156: KMP_AFFINITY: 20 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 2 packages x 10 cores/pkg x 1 threads/core (20 total cores)
OMP: Info #206: KMP_AFFINITY: OS proc to physical thread map:
OMP: Info #171: KMP_AFFINITY: OS proc 0 maps to package 0 core 0
OMP: Info #171: KMP_AFFINITY: OS proc 2 maps to package 0 core 1
OMP: Info #171: KMP_AFFINITY: OS proc 4 maps to package 0 core 2
OMP: Info #171: KMP_AFFINITY: OS proc 6 maps to package 0 core 3
OMP: Info #171: KMP_AFFINITY: OS proc 8 maps to package 0 core 4
OMP: Info #171: KMP_AFFINITY: OS proc 10 maps to package 0 core 8
OMP: Info #171: KMP_AFFINITY: OS proc 12 maps to package 0 core 9
OMP: Info #171: KMP_AFFINITY: OS proc 14 maps to package 0 core 10
OMP: Info #171: KMP_AFFINITY: OS proc 16 maps to package 0 core 11
OMP: Info #171: KMP_AFFINITY: OS proc 18 maps to package 0 core 12
OMP: Info #171: KMP_AFFINITY: OS proc 1 maps to package 1 core 0
OMP: Info #171: KMP_AFFINITY: OS proc 3 maps to package 1 core 1
OMP: Info #171: KMP_AFFINITY: OS proc 5 maps to package 1 core 2
OMP: Info #171: KMP_AFFINITY: OS proc 7 maps to package 1 core 3
OMP: Info #171: KMP_AFFINITY: OS proc 9 maps to package 1 core 4
OMP: Info #171: KMP_AFFINITY: OS proc 11 maps to package 1 core 8
OMP: Info #171: KMP_AFFINITY: OS proc 13 maps to package 1 core 9
OMP: Info #171: KMP_AFFINITY: OS proc 15 maps to package 1 core 10
OMP: Info #171: KMP_AFFINITY: OS proc 17 maps to package 1 core 11
OMP: Info #171: KMP_AFFINITY: OS proc 19 maps to package 1 core 12
OMP: Info #242: KMP_AFFINITY: pid 18925 thread 0 bound to OS proc set {0}
OMP: Info #242: KMP_AFFINITY: pid 18925 thread 1 bound to OS proc set {1}
OMP: Info #242: KMP_AFFINITY: pid 18925 thread 2 bound to OS proc set {2}
OMP: Info #242: KMP_AFFINITY: pid 18925 thread 3 bound to OS proc set {3}
OMP: Info #242: KMP_AFFINITY: pid 18925 thread 4 bound to OS proc set {4}
OMP: Info #242: KMP_AFFINITY: pid 18925 thread 6 bound to OS proc set {6}
OMP: Info #242: KMP_AFFINITY: pid 18925 thread 5 bound to OS proc set {5}
OMP: Info #242: KMP_AFFINITY: pid 18925 thread 8 bound to OS proc set {8}
OMP: Info #242: KMP_AFFINITY: pid 18925 thread 7 bound to OS proc set {7}
OMP: Info #242: KMP_AFFINITY: pid 18925 thread 9 bound to OS proc set {9}
OMP: Info #242: KMP_AFFINITY: pid 18925 thread 10 bound to OS proc set {10}
OMP: Info #242: KMP_AFFINITY: pid 18925 thread 11 bound to OS proc set {11}
OMP: Info #242: KMP_AFFINITY: pid 18925 thread 12 bound to OS proc set {12}
OMP: Info #242: KMP_AFFINITY: pid 18925 thread 13 bound to OS proc set {13}
OMP: Info #242: KMP_AFFINITY: pid 18925 thread 14 bound to OS proc set {14}
OMP: Info #242: KMP_AFFINITY: pid 18925 thread 15 bound to OS proc set {15}
OMP: Info #242: KMP_AFFINITY: pid 18925 thread 17 bound to OS proc set {17}
OMP: Info #242: KMP_AFFINITY: pid 18925 thread 18 bound to OS proc set {18}
OMP: Info #242: KMP_AFFINITY: pid 18925 thread 16 bound to OS proc set {16}
OMP: Info #242: KMP_AFFINITY: pid 18925 thread 19 bound to OS proc set {19}
-------------------------------------------------------------
STREAM version $Revision: 1.4 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 400000000, Offset = 0
Total memory required = 9155.3 MiB.
Each test is run 20 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Number of Threads requested = 20
Number of Threads counted = 20
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 63374 microseconds.
(= 63374 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Rate (MB/s) Avg time Min time Max time
Copy: 109216.8898 0.0587 0.0586 0.0588
Scale: 109468.9808 0.0585 0.0585 0.0586
Add: 112044.8076 0.0857 0.0857 0.0859
Triad: 111813.6308 0.0859 0.0859 0.0860
-------------------------------------------------------------
Solution Validates: avg error less than 1e-15 on all three arrays
-------------------------------------------------------------
Performance counter stats for './stream.omp.AVX2':
128260.580114 task-clock # 18.900 CPUs utilized
720 context-switches # 0.006 K/sec
28 cpu-migrations # 0.000 K/sec
6,319 page-faults # 0.049 K/sec
371,839,666,252 cycles # 2.899 GHz
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
77,729,679,802 instructions # 0.21 insns per cycle
14,683,151,696 branches # 114.479 M/sec
2,179,317 branch-misses # 0.01% of all branches
6.786395421 seconds time elapsed
--
John D. McCalpin, Ph.D.
Texas Advanced Computing Center
University of Texas at Austin
https://www.tacc.utexas.edu/about/directory/john-mccalpin
Received on Thu Oct 15 2015 - 15:03:29 CDT