ce107@cfm.brown.edu (C. Evangelinos) writes:
> Which in turn poses the question: On systems where non-cacheable
> load/stores (like VIS) can be used or on others where certain stores
> can avoid write-allocate traffic (eg. the new PIII SIMD extensions
> unless I'm mistaken) are there compilers that will actually generate
> the code (preferably in the absence of non-portable directives) or
> does one always need to go and use assembler or special tools like
> Intel's Vtune? STREAM experimental numbers contain very impressive
> results for Suns using VIS but at least the 4.2 compilers didn't
> appear capable to generate such code even for a simple copy loop with
> a hard-coded loop length... No idea if the 5.0 compilers are any
> better.
The SunPro 5.0 compilers can handle limited cases (I think they have
an inline template for copy loops and probably some vector loops). If
you look at the latest SPEC submissions you should find the extra
flags to force these optimizations.
One of my students is working on a code generator for SUIF that uses
VIS instructions and some less well-known special features of
UltraSPARC to produce optimal code for STREAM and slightly more
complicated loops that arise e.g. in sparse matrix solvers. I have
one isolated loop where I have a parametrizable m4 assembly code
generator for. Below are results for C++ compiled and C++ with an
assembly inset. The results are in MB/s, the rows are results for
different sizes of the respective arrays and the columns are the size
of a single array (there are six of it), minimum bandwith obtained for
any repetition, root mean squared (RMS over all repetitions), two
arithmetic averages (one with loop overhead and one without) and at
last maximum bandwith obtained for any repetition. Machine tested is
an U60/2360 (only one processor used). Even though I've used the
high-res timer, the first three rows have to be takenn with a grain of
salt due to the timing overhead. The code can't use any VIS, but I've
used read and write prefetch in the assembly version to make the
in-cache results more stable. The fun thing is that while the
assembly was originally intended for out-of-cache sizes it turns out
that it does much better than the compiled code for the in-cache case
and actually approaches the theoretical limit, whereas the compiled
code barely makes half of it.
~/solver (5) vh5
Size of one vector in MB: 13.57
Number of repetitions: 1024
Min RMS Avg1 Avg2 Max
16 388.8 894.9 622.7 895.6 896.8
32 628.3 1108. 855.6 1108. 1129.
64 744.3 1332. 1144. 1333. 1342.
128 766.0 1410. 1297. 1414. 1423.
256 830.7 1433. 1366. 1439. 1456.
512 773.9 979.4 961.2 980.5 989.1
1024 744.4 771.5 765.4 771.5 773.4
2048 600.1 730.4 730.0 733.0 762.0
4096 717.7 753.5 752.1 753.7 759.1
8192 677.3 724.1 720.6 724.8 755.3
16384 601.5 656.5 656.7 657.1 678.5
32768 546.7 561.8 562.0 562.2 615.8
65536 511.5 526.5 526.6 526.8 578.6
131072 377.3 386.2 386.9 386.9 460.6
262144 299.1 308.1 308.6 308.6 363.8
524288 294.5 302.8 302.9 302.9 331.2
1048576 298.3 301.8 301.8 301.8 309.5
~/solver (6) vh5.RWA
Size of one vector in MB: 13.57
Number of repetitions: 1024
Min RMS Avg1 Avg2 Max
16 158.8 1356. 901.5 1396. 1429.
32 764.8 1881. 1406. 1885. 1893.
64 1291. 2171. 1811. 2173. 2180.
128 1742. 2374. 2141. 2375. 2384.
256 1992. 2403. 2242. 2404. 2415.
512 2174. 2448. 2360. 2448. 2458.
1024 2056. 2315. 2274. 2317. 2337.
2048 2120. 2425. 2402. 2428. 2480.
4096 1936. 2254. 2246. 2258. 2315.
8192 1893. 2233. 2232. 2238. 2304.
16384 1697. 1716. 1714. 1716. 1720.
32768 1118. 1171. 1172. 1173. 1378.
65536 953.4 990.1 992.3 992.7 1222.
131072 598.2 630.9 633.2 633.4 825.5
262144 446.6 459.2 460.4 460.4 572.2
524288 440.2 447.8 448.1 448.1 496.2
1048576 431.0 439.9 440.0 440.0 454.2
STREAM results for the same machine (these results could probably be
improved by more sophisticated placements of the arrays):
1 CPU base result (C 4.2 compiled)
Assignment: 355.8263 0.1428 0.1415 0.1474
Scaling : 344.8744 0.1463 0.1459 0.1480
Summing : 364.6393 0.2083 0.2070 0.2126
SAXPYing : 350.3134 0.2159 0.2155 0.2174
1 CPU result (C 4.2 compiled with compiler generated prefetch)
Function Rate (MB/s) RMS time Min time Max time
Assignment: 348.7358 0.1391 0.1376 0.1533
Scaling : 354.9112 0.1362 0.1352 0.1401
Summing : 407.6570 0.1779 0.1766 0.1792
SAXPYing : 411.1044 0.1764 0.1751 0.1770
2 CPU result (C 4.2 compiled with compiler generated prefetch)
Assignment: 480.4420 0.1014 0.0999 0.1244
Scaling : 484.4671 0.1007 0.0991 0.1242
Summing : 525.6779 0.1388 0.1370 0.1711
SAXPYing : 561.3117 0.1288 0.1283 0.1298
1 CPU (C 4.2 based, experimental with VIS/assembly, no prefetch)
Function Rate (MB/s) RMS time Min time Max time
Assignment: 617.6496 0.0780 0.0777 0.0792
Scaling : 458.0195 0.1060 0.1048 0.1123
Summing : 511.3926 0.1414 0.1408 0.1433
SAXPYing : 496.0045 0.1464 0.1452 0.1489
2 CPU (C 4.2 based, experimental with VIS/assembly, no prefetch)
Function Rate (MB/s) RMS time Min time Max time
Assignment: 741.3924 0.0651 0.0647 0.0665
Scaling : 685.2633 0.0703 0.0700 0.0710
Summing : 734.5970 0.0982 0.0980 0.0985
SAXPYing : 726.9866 0.0992 0.0990 0.0995
While the prefetch isn't very effective at improving the numbers per
se, it helps tremendously with badly placed data.
Achim Gratz.
--+<[ It's the small pleasures that make life so miserable. ]>+--
WWW: http://www.inf.tu-dresden.de/~ag7/{english/}
E-Mail: gratz@ite.inf.tu-dresden.de
Phone: +49 351 463 - 8325
This archive was generated by hypermail 2b29 : Tue Apr 18 2000 - 05:23:08 CDT