From: Schmidt, David (Performance Eng.) (d.schmidt@hp.com)
Date: Mon Feb 14 2005 - 21:08:26 UTC
John,
Below are STREAM results for the HP ProLiant DL385 (2 CPU), DL585 (4
CPU), BL25p (2 CPU), BL35p (2 CPU), and ML370 G4 (1 CPU). The
configurations are described below with the results:
HP ProLiant DL385
2x2.6GHz/1MB L2 252 Opteron processors
16GB PC3200 DDR memory (8x2GB DIMMs)
SuSE Linux Enterprise Server 9 (x86_64)
I used Revision 5.3 of the stream code and compiled with PGI C/C++ for
Linux v.5.2-4:
pgcc -O2 -Mvect=sse -Mnontemporal -Munsafe_par_align -mp -o ompstream
stream_omp.c
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 1000000, Offset = 49152
Total memory required = 22.9 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Number of Threads requested = 2
Number of Threads requested = 2
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 4107 microseconds.
(= 4107 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Rate (MB/s) Avg time Min time Max time
Copy: 7928.7410 0.0018 0.0020 0.0020
Scale: 7827.9323 0.0018 0.0020 0.0020
Add: 8244.3322 0.0026 0.0029 0.0030
Triad: 8247.0339 0.0026 0.0029 0.0029
-------------------------------------------------------------
Solution Validates
-------------------------------------------------------------
========================================================================
=========
HP ProLiant DL585
4x2.6GHz/1MB L2 852 Opteron processors
32GB PC2700 memory (16x2GB DIMMs)
SuSE Linux Enterprise Server 9 (x86_64)
I used Revision 5.3 of the stream code and compiled with PGI C/C++ for
Linux v.5.2-4:
pgcc -O2 -Mvect=sse -Mnontemporal -Munsafe_par_align -mp -o ompstream
stream_omp.c
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 2500000, Offset = 49152
Total memory required = 57.2 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Number of Threads requested = 4
Number of Threads requested = 4
Number of Threads requested = 4
Number of Threads requested = 4
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 14740 microseconds.
(= 14740 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Rate (MB/s) Avg time Min time Max time
Copy: 13893.0242 0.0026 0.0029 0.0029
Scale: 13894.1747 0.0026 0.0029 0.0029
Add: 14599.0393 0.0037 0.0041 0.0042
Triad: 14562.7128 0.0037 0.0041 0.0041
-------------------------------------------------------------
Solution Validates
-------------------------------------------------------------
========================================================================
=========
HP ProLiant BL25p
2x2.6GHz/1MB L2 Opteron 252 processors
16GB memory (8x2GB DIMMs)
SuSE Linux Enterprise Server 9 (x86_64) - kernel 2.6.5-7.97-smp
I used Revision 5.3 of the stream code and compiled with PGI C/C++ for
Linux v.5.2-4:
/usr/pgi/linux86-64/5.2/bin/pgcc -O2 -Mvect=sse -Mnontemporal
-Munsafe_par_align -mp -o ompstream stream_omp.c
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 2500000, Offset = 512
Total memory required = 57.2 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Number of Threads requested = 2
Number of Threads requested = 2
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 8744 microseconds.
(= 8744 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Rate (MB/s) Avg time Min time Max time
Copy: 8435.0005 0.0043 0.0047 0.0048
Scale: 8407.1036 0.0043 0.0048 0.0048
Add: 8689.2563 0.0062 0.0069 0.0070
Triad: 8670.6946 0.0062 0.0069 0.0070
-------------------------------------------------------------
Solution Validates
-------------------------------------------------------------
========================================================================
=========
HP ProLiant BL35p
2x2.4GHz/1MB L2 250 Opteron processors
8GB PC3200 memory (4x2GB DIMMs)
SuSE Linux Enterprise Server 9 (x86_64)
I used Revision 5.3 of the stream code and compiled with PGI C/C++ for
Linux v.5.2-4:
pgcc -O2 -Mvect=sse -Mnontemporal -mp -o ompstream stream_omp.c
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 6500000, Offset = 16384
Total memory required = 148.8 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Number of Threads requested = 2
Number of Threads requested = 2
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 83088 microseconds.
(= 83088 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Rate (MB/s) Avg time Min time Max time
Copy: 8742.5116 0.0107 0.0119 0.0120
Scale: 8737.4332 0.0107 0.0119 0.0119
Add: 9092.4575 0.0155 0.0172 0.0172
Triad: 9062.3596 0.0155 0.0172 0.0172
-------------------------------------------------------------
Solution Validates
-------------------------------------------------------------
========================================================================
=========
HP ProLiant ML370 G4
1x3.6z/2MB L2 Xeon processors
4GB memory (8x512MB DIMMs)
Windows Server 2003
Intel(R) C++ Compiler for 32-bit applications, Version 8.1 Build
20040802Z
icl -Qopenmp -QxW -Qparallel -O3 -w stream_d_omp.c win_second_wall.c -o
omp2
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 10000000, Offset = 2048
Total memory required = 228.9 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 43215 microseconds.
(= 43215 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Rate (MB/s) RMS time Min time Max time
Copy: 3965.6825 0.0419 0.0403 0.0535
Scale: 3925.5339 0.0409 0.0408 0.0413
Add: 3515.2573 0.0684 0.0683 0.0685
Triad: 3577.7362 0.0674 0.0671 0.0694
Thanks,
David Schmidt
Hewlett-Packard Company
(281) 514-5039
D.Schmidt@hp.com
This archive was generated by hypermail 2.1.4 : Tue Feb 15 2005 - 07:11:56 UTC