STREAM2 measures sustained bandwidth at all levels of the cache hierarchy, and STREAM2 more clearly exposes the performance differences between reads and writes
STREAM2 is based on the same ideas as STREAM, but uses a different
set of vector kernels:
FILL: similar to bzero(), but fills with a constant instead of zero COPY: similar to bcopy(), and the same as STREAM Copy DAXPY: similar to STREAM Triad, but overwrites one of the input vectors instead of writing results to a third vector SUM: sum reduction on a single vector -- reads only, no writes
Kernel | Code |
read |
Bytes/iter
|
|
FILL | a(i) = q |
|
|
|
COPY | a(i) = b(i) |
|
|
|
DAXPY | a(i) = a(i) + q*b(i) |
|
|
|
SUM | sum = sum + a(i) |
|
|
|
The main feature is that the same number of work is done for each vector
length, so the shorter vector lengths are iterated many times and the longer
vector lengths fewer times.
Machine | CPU | MHz | L1
Data Cache |
L2
Data Cache |
Peak
L2 cache Bandwidth |
bus
width @ speed |
Peak
Memory Bandwidth |
IBM RS/6000-397 | POWER2-SC | 160 | 128kB @ 160 MHz | none | N/A | 256 bits @ 80 MHz | 2560 MB/s |
Upgraded Mac clone | PowerPC G3 | 367.5 | 32kB @ 367.5 MHz | 512 kB @ 183.75 MHz | 2940 MB/s | 64 bits @ 52.5 MHz | 420 MB/s |
PowerComputing
PowerCurve 601/120 |
PowerPC 601 | 120 | 64kB (I+D) @ 120 MHz | 256kB @ 40 MHz | 320 MB/s | 64 bits @ 40 MHz | 320 MB/s |
Mac Quadra 650 | Motorola 68040 | 33 | 8 kB @ 33 MHz | none | N/A | 32 bits @ 33 MHz | 132 MB/s |