Results from an older system (released November 2011) -- a Dell PowerEdge R815 with 4 AMD Opteron 6220 (Interlagos/Bulldozer) 8-core processors and 128 MiB of DDR3/1333 DRAM on 16 channels.
These results are run with the Intel 12.1 compiler, but the code was hacked to explicitly call the Linux "sched_setaffinity()" function to bind the OpenMP processes to the desired cores.
This makes the results "Experimental", though it should be possible to obtain the same performance with a compiler that supports streaming stores and does not refuse to apply thread binding when running on an AMD processor. Either the PGI or Open64 compilers should work.
The code was compiled with "icc -O2 -xSSE2 -openmp -DSTREAM_ARRAY_SIZE=80000000"
No "numactl" was needed at runtime, since the desired local allocation is the system default behavior.
Each socket contains two chips, each constituting one NUMA node. Each chip has 4 cores (2 Bulldozer core pairs).
Results for 1,2,3,4,5,6,7,8 NUMA nodes (4,8,12,...,32 threads) are reported, with each set of four threads bound to the four cores on a single chip to get node-by-node scaling.
A quick summary is:
Cores Copy Scale Add Triad
4 16168.522 15611.841 14792.440 15054.594
8 32310.867 31500.593 29829.034 30487.400
12 47246.455 46633.333 44461.114 45831.040
16 62178.138 61027.476 58735.613 60465.245
20 76573.327 77650.951 73775.055 76232.641
24 87443.956 91611.507 85786.794 90519.459
28 100257.878 103937.992 99807.447 103711.106
32 109496.219 121571.276 114468.156 120860.616
--
John D. McCalpin, Ph.D.
Texas Advanced Computing Center
University of Texas at Austin
https://www.tacc.utexas.edu/about/directory/john-mccalpin
- application/octet-stream attachment: log.icc.4
- application/octet-stream attachment: log.icc.8
Received on Wed May 27 2015 - 17:35:37 CDT