Date: Tue, 5 Nov 91 13:51:44 EST
From: "John D. McCalpin" <mccalpin@perelandra.cms.udel.edu>
I was looking over the memory bandwidth results that you provided
for the CM-2, and had a question I hope you could answer.
The Copy and Scale operations ran at about 56 GB/s on the 64-K machine.
This implies a memory bandwidth limitation of
4 Bytes/clock/fpu * 2048 fpus * 7 MHz = 57.3 GB/s
The Sum and Triad operations ran at about 80 GB/s, which implies
that there is more data bandwidth available. Do the fpus have more
than one independent 32-bit data path?
Thanks for any explanation!
--
John D. McCalpin mccalpin@perelandra.cms.udel.edu
Assistant Professor mccalpin@brahms.udel.edu
College of Marine Studies, U. Del. DELOCN::MCCALPIN (SPAN)
No. It is just that the Fortran compiler is very clever and generates the
best possible sequence of code for these cases. By looking at the timings,
even though the machine is doing an extra load, the timing is not much
different from the code doing 2 loads, the difference is about a 5%. I see
a very similar thing going on the Cray.
CM2 64K
--------
gorka(test)% stream
-------------------------------------
Double precision appears to have 16 digits of accuracy
Assuming 8 bytes per DOUBLEPRECISION word
-------------------------------------
Calibrating CM timer...Done. CM speed = 7.00 MHz
Timing calibration ; time = 143.324911594391 hundredths of a second
Increase the size of the arrays if this is <30 and your clock precision is =<1
/100 second
---------------------------------------------------
Function Rate (MB/s) RMS time Min time Max time
Assignment 56555.9654 .0231 .0231 .0231
Scaling 56555.9654 .0231 .0231 .0231
Summing 81244.0132 .0242 .0242 .0242
SAXPYing 80593.2540 .0244 .0244 .0244
Here are the two DO loops and the vector code that our compiler generates:
First DO loop:
DO 40 j = 1,n
c(j) = a(j) + b(j)
40 CONTINUE
Code generated by compiler, every instruction is a vector instruction here.
procedure _stream_pe_code_3
L1$_stream_pe_code_3:
popq aC2
popa SP
# Get address of A
popa aP2
# Get address of B
popa aP3
# Get address of C
popa aP4
L2$_stream_pe_code_3:
dflodv [aP3+0]2++ aV0
# "stream_d.fcm" line 102
# C = A + B
dfaddv [aP2+0]2++ aV0 aV1
dfstrv aV1 [aP4+0]2++
jnz aC2 L2$_stream_pe_code_3
end
Second DO loop:
DO 50 j = 1,n
c(j) = a(j) + 3.0D0*b(j)
50 CONTINUE
Code generated by compiler, every instruction is a vector instruction here.
procedure _stream_pe_code_4
L1$_stream_pe_code_4:
popq aC2
popa SP
# Get address of A
popa aP2
# Get address of B
popa aP3
# Get address of C
popa aP4
# "stream_d.fcm" line 110
# C = A + 3.0D0*B
dflodc $3.000000000000000000d+00 aS28
L2$_stream_pe_code_4:
dflodv [aP2+0]2++ aV0
# "stream_d.fcm" line 110
# C = A + 3.0D0*B
dfmuladdv aS28 [aP3+0]2++ aV1 aV0 aV1
dfstrv aV1 [aP4+0]2++
jnz aC2 L2$_stream_pe_code_4
end
This archive was generated by hypermail 2b29 : Tue Apr 18 2000 - 05:23:02 CDT