||>(c) So for a 16K CM2 he gets 20148.3135 MB/s which *I* compute to be
||> 5.622 bytes/PE/clock which is faster than the theoretical rate which *I*
||> think is 4 bytes/PE/clock.
||> (Those emphatic I's mean : feel free to have a different opinion).
||I have been surprised that no one has come forward with a convincing,
||definitive explanation on this....
I verified it with the local Weitek expert (Tony Kennedy) and yes, one slice
per clock. Only one hardware path between memory and the Weitek.
||I believe that I stated in the last posting that the transfer rates
||were computed by
|| MB/s = 12 * MFLOPS
||for the SAXPY operation.
So you did. Sorry. Anyway, as the Cray folks objected against xfer rate
directly, it's immaterial now.
(By the way :
time I mflops I mflops you
report compute compute
Scale 0.523 239.0 261.9 ??
Sum 0.716 174.6 159.3 ??
SAXPY 0.750 333.3 333.3)
||Alex's results showed about 80 GB/s for SAXPY, and no one can explain
||how results greater than 55 GB/s can be obtained.
81 GB for summing vs. 57.344 SoL = 1.4 x SoL (not 1.6 as you have been saying).
So much for history. I have now made more runs to establish clearly that
the timer was bad ; with the proper timer you get xfer rates of 2 bytes/clock
which is half the SoL. Why not more, I don't know.
New results, 7 MHz CM2 (this was 8K but it's independent of size) :
(repeat count 1000)
time MB/s bytes/PE/clock ops/PE/clock Mflops
Assignment 50.8104 3224.5 1.80
Scaling 50.8113 3224.5 1.80 0.1125 201.5
Summing 67.1950 3657.4 2.04 0.0825 152.4
SAXPYing 70.3214 3494.8 1.95 0.1625 291.2
For Sum and SAXPY, this is identical to what I reported before with an
entirely different program. My older number for Scale (corrected for
clock speed) was 209 mflops.
Haven't had a chance at a CM-200 yet.
This archive was generated by hypermail 2b29 : Tue Apr 18 2000 - 05:23:02 CDT