John,
I did mean but did not say that the results can be a factor of two worse
with different alignments. This happens on an X-MP, but not on the Y-MP, where
the memory is better.
It would be nice to add a column of bytes/cycle to your table of results.
The C-90 theoretical peak is 3*2*8*16 = 768 bytes/cycle, or 1024 bytes/cycle
if you include the i/o ports as well.
The results from the Convexes will be interesting. Since there is only
one port to memory for each CPU, the rate is 8 bytes/cycle per CPU, or
32 bytes/cycle for C2 and 64 bytes/cycle for C3. Thus the SAXPY operation
can proceed only at one third of the speed the CPU is capable of, since the
data transfer is the bottleneck. The Linpack 1000 results are obtained by
rewriting the code with much unrolling to have only one vector reference
for each two arithmetic operations.
I now have good autotasking results from our Y-MP2/216.
Alignment and stride tests are fun! You can make the Y-MP slow down by
a factor of 15 (12 on later models) by striding all the vectors through with
a stride of k*(number of banks). On the Convexes (and I suspect the IBM 3090
VF), with large strides you hit cache problems in a big way.
Cheers,
Rob.
P.S. Another useful figure would be peak memory transfer rate / peak meaflops,
as a measure of the memory performance relative to what most people look at.
This archive was generated by hypermail 2b29 : Tue Apr 18 2000 - 05:23:01 CDT