John
The optimized results for the stream benchmark on a CM5 do look funny.
I changed the constant 3.0D0 in the SAXPY operation to DBLE(k),
where k is the loop index, and this inhibited whatever optimization
was going on, to produce the results below. I also noticed that the
unoptimized code could handle much larger problems (200,000 elements per VU)
while the optimized code ran out of memory well before then (an extra temporary)
Will get, and forward assembly listing of just the SAXPY loop (if you are
interested). Have you go any other info on how stream runs on a CM5?
For large enough vectors I would expect it to scale well to larger machines.
Rob
=============================================================================
NO Optimization: cmf stream_d.fcm
SAXPY with constant from loop index
s = DBLE(k)
CALL CM_timer_clear(4)
CALL CM_timer_start(4)
c = a + s * b
CALL CM_timer_stop(4)
t = CM_timer_read_elapsed(4)
times(4,k) = t
STREAM: Measure memory transfer rates in MB/s
for simple computational kernels in Fortran
CM5 with partition of 16 processors ( 64 vector units )
--------------------------------------
Double precision appears to have 16 digits of accuracy
Assuming 8 bytes per DOUBLEPRECISION word
--------------------------------------
Array length = 9600000
Elements per VU = 150000
Timing calibration: Time = 5.455848484848485 hundredths of a second
Increase the size of the arrays if this is < 30
and your clock precision is =< 1/100 second
---------------------------------------------------------
Function : Rate (MB/s) RMS time Min time Max time
Assignment: 4887.33331 0.03143 0.03143 0.03145
Scaling : 4893.41038 0.03153 0.03139 0.03276
Summing : 5145.05312 0.04500 0.04478 0.04615
SAXPYing : 5143.42075 0.04501 0.04480 0.04616
=======================================================================
Optimization ON: cmf -O stream_d.fcm
STREAM: Measure memory transfer rates in MB/s
for simple computational kernels in Fortran
CM5 with partition of 16 processors ( 64 vector units )
--------------------------------------
Double precision appears to have 16 digits of accuracy
Assuming 8 bytes per DOUBLEPRECISION word
--------------------------------------
Array length = 9600000
Elements per VU = 150000
Timing calibration: Time = 5.468930303030303 hundredths of a second
Increase the size of the arrays if this is < 30
and your clock precision is =< 1/100 second
---------------------------------------------------------
Function : Rate (MB/s) RMS time Min time Max time
Assignment: 4894.00569 0.03139 0.03139 0.03142
Scaling : 4898.13904 0.03144 0.03136 0.03273
Summing : 7328.59745 0.03158 0.03144 0.03280
SAXPYing : 5143.05891 0.04494 0.04480 0.04615
This archive was generated by hypermail 2b29 : Tue Apr 18 2000 - 05:23:03 CDT