Hi again,
Here's the STREAM results for a hand-tuned version running on an Apple G4
machine. As before, it's a 400 MHz system; it may also be worth noting
that my test machine has 3-2-2 SDRAM installed.
The major change between my previous results and this one are that I
recoded the inner loop into assembly and added use of 'dcbz'. The prefetch
strategy is similar to the previous, but the exact parameters used may have
been improved. (Empirically these settings seem to work a little better,
though theoretically there shouldn't really be any difference. I wish I
knew exactly how the data streaming was implemented.)
Looking at these, the improvement is reasonably good for what after all is
a memory-bandwidth limited test. Presumably most if not all of this is due
to using dcbz.
Copy: 19%
Scale: 17%
Add: 9%
Triad: 12%
The results fluctuate more than I'd like from run to run, even when bumping
up the array size or number of runs. I suspect some hidden state (TLBs,
perhaps) is affecting the results.
I don't know why the 'add' test runs more slowly than 'triad'. This is
quite counterintuitive as the inner loop is exactly the same except that
triad is using multiply-adds, which are somewhat slower than additions.
The simulator I'm using claims each should run in the same number of clock
cycles (the extra latency of the multiplication is hidden by memory
stalls), but real life is different.
I've attached my hard-coded loops. (I modified the main source to just
call these procedures.)
Thanks again!
-- Anton
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 400000, Offset = 0
Total memory required = 9.2 MB.
Each test is run 20 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity/precision appears to be 10 microseconds.
Each test below will take on the order of 21117 microseconds.
(= 2111 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Rate (MB/s) RMS time Min time Max time
Copy: 569.6991 0.0115 0.0112 0.0118
Scale: 545.0055 0.0119 0.0117 0.0123
Add: 484.1150 0.0200 0.0198 0.0202
Triad: 497.1517 0.0195 0.0193 0.0198
-- This is an unauthorized communication. "The statements and opinions expressed herein are my own and do not necessarily represent those of Adaptec."
This archive was generated by hypermail 2b29 : Sat Apr 29 2000 - 11:18:23 CDT