Alex V. sent me an agitated letter asking basically what the hell I was
doing supplying lower numbers about the long-vector benchmarks.
As I remember it, you told me in October that Alex was reporting faster-than-
speed-of-light results, but I never saw his numbers nor did I save the results
which you posted. You say Alex was exceeding SoL by 1.6 or 160 depending
whether you believe the code indeed did execute the repeat loop.
Charlie Grassl then pointed out that something was wrong - I can't remember
seeing that either.
I am now playing with the program you sent me in December, which I assume is
the one Alex was using, and indeed it gives incomprehensible results ; I am
still assuming it's the timer he's using which is broke.
Could you provide the results he sent you, and the MFLOP rates you computed
from that ? Thanks.
This is what I have so far. This is for SAXPY - others are similar.
Typically, Asst/Scale *times* are a bit less than Sum/SAXPY times.
The first two columns are program output. The next two are my computations.
Now here I was assuming that both codes (with and w/o opt) did execute the
repeat count of 100 - I know the MB/s reported by the program is not counting
that.
The longest vectors are when using all of memory.
Again, this is *Alex's* program using CM_timer_read_cm_busy.
SAXPY Vector MB/s mintime bytes/PE/ ops/PE/
length clock clock
Unoptimized :
CM-2 (7 MHz) 4K 10000 28.0604 2.1896 3.131 0.2610
CM-2 (7 MHz) 4K 20000 50.3708 2.4395 5.627 0.4689
CM-2 (7 MHz) 8K 10000 31.1904 1.9698 1.741 0.1450
CM-2 (7 MHz) 8K 20000 56.1208 2.1896 3.131 0.2610
CM-2 (7 MHz) 8K 40000 100.7416 2.4395 5.627 0.4689
CM-2 (7 MHz) 16K 20000 62.3807 1.9698 1.741 0.1450
CM-2 (7 MHz) 16K 40000 112.2415 2.1896 3.132 0.2610
CM-2 (7 MHz) 16K 80000 201.4831 2.4395 5.622 0.4685
CM-200 (10 MHz) 8K 40000 108.1830 2.2717 4.226 0.3522
Optimized :
CM-2 (7 MHz) 4K 10000 48.5427 1.2657 5.417 0.4514
CM-2 (7 MHz) 4K 20000 88.3611 1.3907 9.862 0.8218
CM-2 (7 MHz) 8K 20000 97.0858 1.2657 5.417 0.4514
CM-2 (7 MHz) 8K 40000 176.7226 1.3907 9.862 0.8218
CM-2 (7 MHz) 16K 20000 107.7191 1.1407 3.006 0.2505
CM-2 (7 MHz) 16K 40000 194.1716 1.2657 5.418 0.4515
CM-2 (7 MHz) 16K 80000 353.4454 1.3907 9.862 0.8218
CM-200 (10 MHz) 8K 20000 104.6718 1.1740 4.089 0.3408
CM-200 (10 MHz) 8K 40000 189.2043 1.2989 7.391 0.6159
CM-200 (10 MHz) 8K 80000 345.1913 1.4239 13.484 1.1237
CM-200 (10 MHz) 8K 160000 634.6840 1.5489 24.792 2.0660
There is some method in this madness. At least the timer output is positively
correlated with reality. Also, factoring out machine size gives identical
results. However, comparing CM2 with CM200 shows the timer screws up
differently dependent on machine type.
On optimization : it appears the compiler does optimize away something.
For CM2 8K / N=20000 :
reported estimate from wall clock
unoptimized 2.1896 3.1
optimized 1.2657 0.21
Okay, so I did the obvious thing and increased the repeat count, and the
time (optimized OR unoptimized) didn't change.
THEREFORE, the performance I computed above is actually a factor of 100
smaller. Which is nonsense.
Now *you* were in doubt whether Alex was reporting a rate 1.6 or 160 x SoL.
As I see it, he was reporting either 1.6 or 0.016 x SoL. Which is still
nonsense. But then, I don't know which run he reported.
I will have to write a test program demonstrating the timer.
This archive was generated by hypermail 2b29 : Tue Apr 18 2000 - 05:23:02 CDT