Single thread results on Xeon E3-1270 -- single-socket Sandy Bridge with client uncore and 2 channels of DDR3/1333 DRAM.
The Xeon E3-1270 processor has 4 cores with a base frequency of 3.4 GHz and a single-core maximum Turbo frequency of 3.8 GHz.
The system was running with HyperThreading enabled, and the process was bound to logical core 1 using "taskset -c 1".
Other system info (from "lshw"):
product: X9SCL/X9SCM
vendor: Supermicro
4 identical 4 GiB DIMMs
description: DIMM Synchronous 1333 MHz (0.8 ns)
product: 9965413-028.A00LF
vendor: Kingston
physical id: 0
serial: 87289B77
slot: DIMM_1A
size: 4GiB
width: 64 bits
clock: 1333MHz (0.8ns)
These DIMMs appear to be the same as Kingston KVR1333D3E9S/4G, which is a 2-rank, x8 part composed of 18 2Gbit DRAMs.
Two dual-rank DIMMs per channel should give very close to the best possible performance -- it might be a little bit better with a single thread with one dual-rank DIMM per channel (I.e., fewer rank-to-rank stalls), but the difference should be negligible.
Both versions were compiled with the Intel 12.1 C compiler using "-xAVX -ffreestanding -O". The version without streaming stores was done by hacking the assembler to ensure that the code was absolutely positively the same except for the store instructions. Essentially identical results would have been obtained by adding the "-opt-streaming-stores never" flag to the compile line. The default array size was sufficient for this system, since it has only 8 MiB of L3 cache.
Summary:
Kernel w/streaming stores w/o streaming stores
Copy: 17807.8 12316.9
Scale: 17807.8 12102.6
Add: 18162.1 12976.3
Triad: 18134.3 13136.3
--
John D. McCalpin, Ph.D.
Texas Advanced Computing Center
University of Texas at Austin
https://www.tacc.utexas.edu/about/directory/john-mccalpin
- application/octet-stream attachment: log.movnt
Received on Wed Jun 03 2015 - 07:45:49 CDT