I'll include the 64 processor results when I send them in. They were a
smidge lower than for 63, and I wanted to post the highest numbers.
The machine had no other users on it, but of course its still a live Unix
system -- doing such nice things as sending me the perfmeter info. So, when
one asks for all 64 procs to do a parellel call together, you get a tiny
bit of delay I suppose.
Standard
--------
63 procs 1024 K elements per proc 10 loop repeats 1512 MB memory
8 byte alignment, 0 byte offset
Auto-parallel C
Min RMS Max Max ---Total---- --Per Proc-- ----Time per----
time time time Min Stream Total Stream Total Load Store Mem
sec sec sec Range MB/s MB/s MB/s MB/s ns ns ns
Copy : 0.168 0.169 0.170 1% 6299 9449 100 150 640 1280 427
Scale: 0.166 0.167 0.168 2% 6385 9577 101 152 632 1263 421
Vadd : 0.221 0.221 0.222 0% 7171 9562 114 152 562 1687 422
Triad: 0.221 0.221 0.222 0% 7182 9576 114 152 561 1684 421
64 procs 1024 K elements per proc 10 loop repeats 1536 MB memory
8 byte alignment, 0 byte offset
Auto-parallel C
Min RMS Max Max ---Total---- --Per Proc-- ----Time per----
time time time Min Stream Total Stream Total Load Store Mem
sec sec sec Range MB/s MB/s MB/s MB/s ns ns ns
Copy : 0.173 0.174 0.174 1% 6202 9303 97 145 660 1321 440
Scale: 0.171 0.172 0.172 1% 6281 9422 98 147 652 1304 435
Vadd : 0.228 0.229 0.229 0% 7059 9412 110 147 580 1741 435
Triad: 0.227 0.228 0.228 0% 7087 9449 111 148 578 1734 433
Experimental
------------
63 procs 1024 K elements per proc 10 loop repeats 1512 MB memory
8192 byte alignment, 8192 byte offset
VIS
Min RMS Max Max ---Total---- --Per Proc-- ----Time per----
time time time Min Stream Total Stream Total Load Store Mem
sec sec sec Range MB/s MB/s MB/s MB/s ns ns ns
Copy : 0.103 0.105 0.110 7% 10279 10279 163 163 784 784 392
Scale: 0.103 0.105 0.110 7% 10258 10258 163 163 786 786 393
Vadd : 0.156 0.157 0.163 5% 10190 10190 162 162 594 1187 396
Triad: 0.157 0.158 0.164 5% 10119 10119 161 161 598 1195 398
64 procs 1024 K elements per proc 10 loop repeats 1536 MB memory
8192 byte alignment, 8192 byte offset
VIS
Min RMS Max Max ---Total---- --Per Proc-- ----Time per----
time time time Min Stream Total Stream Total Load Store Mem
sec sec sec Range MB/s MB/s MB/s MB/s ns ns ns
Copy : 0.107 0.108 0.114 7% 10042 10042 157 157 816 816 408
Scale: 0.107 0.108 0.108 1% 10011 10011 156 156 818 818 409
Vadd : 0.162 0.166 0.172 6% 9938 9938 155 155 618 1236 412
Triad: 0.162 0.163 0.164 1% 9917 9917 155 155 620 1239 413
>Thanks for the clarification -- my guesses were pretty close!
>
>It is interesting to note the use of 63 cpus. It can get
>difficult to spread the scheduler thin enough to not cause
>trouble when running 64-way jobs on a 64-cpu machine....
>--
>John D. McCalpin, Ph.D. Supercomputing Performance Analyst
>Scalable Systems Group http://reality.sgi.com/employees/mccalpin
>Silicon Graphics, Inc. mccalpin@sgi.com 415-933-7407
This archive was generated by hypermail 2b29 : Tue Apr 18 2000 - 05:23:06 CDT