"Experimental" DP Stream for P4 1.4 GHz, using
streaming stores, average 10 runs:
copy_ 2102.07 MB/s
scale 2101.54 MB/s
add__ 2111.31 MB/s
triad 2103.53 MB/s
All under Windows 98 se; basic system services and
Intel system monitor running in background.
--------------------------
System: 1.4 GHz Pentium4 with 1 GB PC800 RDRAM,
in Intel i850 board (homemade system).
All 64 RDRAM devices taken, so this is not an ideal
system r.e. latency.
Hard disk is 5400 rpm 30 GB Maxtor, ATA/66
Executable: Compiled with Intel C/C++ 5.0, under MS
VC++ 6.0:
/ML /W2 /GX /D "WIN32" /D "NDEBUG" /D
"_CONSOLE" /FA /Fa"Release/"
/Fp".\Release\stream.pch" /YX /Fo".\Release/"
/Fd".\Release/" /FD /G7 /O3 -Qrestrict -QxW
/c
.only /G7 /O3 -QxW are relevant; the codes
has no restricts.
Ran 10 times in sequence (from batch file), and
averaged all 10 for results reported above.
Attached:
(1) Modified source. I added calls to a simple validation
function, similar to that provided with FORTRAN
source, and I used a second() function provided by a
previous poster (my apologies, I've forgotten his
name). The automatic vectorizer would handle every
part of the translation except the streaming stores.
The current, or future versions of vtune, may add
streaming stores automatically, which would remove
the need to call the results "experimental". I used
the Intel intrinsic functions, to avoid fiddling with
assembly language; for this version, no attempt was
made to use prefetch instructions.
(2) Executable, for P4.
(3) GIF file showing variation with run number.
(4) Batch file to run 10 times, and output of 10 runs
(C_stream_SIMD.txt).
This archive was generated by hypermail 2b29 : Mon Apr 23 2001 - 09:29:53 CDT