(A copy of this message has also been posted to the following newsgroups:
comp.sys.super, comp.arch, comp.benchmarks, comp.sys.sun.misc)
In article <5g9mqi$kif@murrow.corp.sgi.com>, mccalpin@asd.sgi.com wrote:
>
> I have some numbers for the UE 10000 (64-cpu only), but my
> understanding is that they were preliminary, so I was waiting
> for the rest of the numbers before putting them in the table.
>
> I guess I should follow up on this and find out if I misunderstood
> the intent of the message I received from Sun. I was certainly
> hoping to get some numbers from smaller processor counts on the
> UE10000 as well.
>
> Since the numbers were posted to USENET, I will repeat them here:
>
> omitted
> --
> John D. McCalpin, Ph.D. Supercomputing Performance Analyst
> Scalable Systems Group http://reality.sgi.com/employees/mccalpin
> Silicon Graphics, Inc. mccalpin@sgi.com 415-933-7407
Sorry John, for my not getting these out to the public sooner. Here are the
Starfire Stream results that I ran at the end of January.
1. Auto-parallel C Stream bandwidth
Copy Scale Vadd Triad
Cpus MBps MBps MBps MBps
1 164 164 202 202
8 1,271 1,270 1,544 1,546
16 2,371 2,414 2,942 2,905
24 3,568 3,577 4,292 4,305
32 4,397 4,408 5,166 5,188
40 5,317 5,374 6,162 6,222
48 5,961 6,056 6,861 6,914
56 6,183 6,304 7,131 7,128
63 6,307 6,391 7,203 7,197
2. Auto-parallel C total interconnect bandwidth
These are the Table 1 numbers, multiplied by 3/2 for copy
and scale, and 4/3 for vadd and triad -- to account for
write-allocate traffic on the interconnect. They are
useful to compare against the peak bandwidth of 10,667 MBps.
Copy Scale Vadd Triad
Cpus MBps MBps MBps MBps
1 246 246 269 269
8 1,907 1,905 2,059 2,062
16 3,557 3,620 3,922 3,873
24 5,353 5,366 5,722 5,740
32 6,595 6,612 6,888 6,917
40 7,976 8,062 8,215 8,296
48 8,942 9,083 9,148 9,219
56 9,274 9,456 9,508 9,505
63 9,461 9,586 9,604 9,596
3. VIS assembler "experimental" Stream bandwidth
The SPARC Visual Instruction Set (VIS) includes
block load and store instructions which move between a
64-byte aligned block of memory and eight floating-point
registers. Because an entire cache-block is accessed, no
extra write-allocate traffic is necessary on the interconnect
Comparing to Table 2, My VIS assembler code loops
get a bit more total interconnect traffic
outstanding than the stock C code did.
Copy Scale Vadd Triad
Cpus MBps MBps MBps MBps
1 325 322 288 263
8 2,499 2,491 2,252 2,099
16 4,527 4,669 4,243 3,944
24 6,720 6,759 6,156 5,860
32 7,872 7,987 7,377 7,092
40 9,277 9,355 8,877 8,594
48 9,938 9,917 9,618 9,373
56 10,250 10,175 10,030 9,910
63 10,307 10,180 10,181 10,107
This archive was generated by hypermail 2b29 : Tue Apr 18 2000 - 05:23:06 CDT