[stream] The importance of using OFFSET in STREAM

From: David T. Wang (davewang@wam.umd.edu)
Date: Fri May 07 2004 - 14:10:12 CDT

Next message: Petr Pospíšil: "[stream] STREAM benchmark"

Previous message: Steve Obenschain: "[stream] AMD Athlon 3200+ stream result"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

The Problem:

I had previously submitted some STREAM bandwidth results on
a Dell PowerEdge 400SC. Recently, as I was going through some notes, I
realized that I had not "tuned" the software to perform optimally on the
hardware. Specifically, by setting the OFFSET in STREAM to 0, the address
boundaries were perfectly aligned as to cause the read and write streams
of A[], B[], and C[] to be mapped to the same DRAM bank. It turns out
that by picking the correct offset, STREAM triad and add scores can be
increased by ~20%.

*********************************************************************

References:

Lin, Reinhardt, Burger, "Reducing DRAM latencies with an Integrated Memory
Hierarchy Design" HPCA 7, Jan 2001

Zhang, Zhu, Zhang, "Breaking Address Mapping Symmetry at Multi-levels of
Memory Hiearchy to Reduce DRAM Row-buffer Conflicts", Journal of
Instruction Level Parallelism, Vol 3, 2002.

Intel i875P chipset datasheet.

*********************************************************************

Optimiaztion Effort:

It turns out that For the 875P chipset, the bank id is placed on physical
address bits 14 and 15 for 256 Mbit DRAM chips. So all arrays larger than
2^16 bytes needs to be offset by 2^14 bytes to try to get the arrays to be
mapped to different banks. (Assume that virtual address to physical
address ends up being contiguous, which as it turns out, wasn't a bad
assumption)

So, for STREAM, the offset needs to be 2^14 bytes. Since sizeof(double)
is 8 bytes, OFFSET needs to be 2^11 in this system configuration.

**********************************************************************

System Configuration:

Dell PowerEdge 400SC 2.8GHz Pentium 4 in HT mode. Mandrake Linux
800 Mbps FSB, "dual channel" PC3200 DDR SDRAM peak BW = 6400 MB/s

2 Kingston KVR400X72C3A/512 DIMMs ECC 3-3-3 timing.

**********************************************************************

The Result:

wowbagger:~/stream: ./stream_nooffset
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 8388608, Offset = 0
Total memory required = 192.0 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity appears to be less than one microsecond.
Each test below will take on the order of 44954 microseconds.
(= -2147483648 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Rate (MB/s) RMS time Min time Max time
Copy: 2331.2204 0.0583 0.0576 0.0593
Scale: 2473.4660 0.0547 0.0543 0.0552
Add: 2584.2938 0.0789 0.0779 0.0813
Triad: 2496.4570 0.0816 0.0806 0.0826
wowbagger:~/stream: vi stream_d.c
wowbagger:~/stream: cc -O stream_d.c second_wall.c -o stream_offset -lm
wowbagger:~/stream: ./stream_offset
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 8388608, Offset = 2048
Total memory required = 192.0 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity appears to be less than one microsecond.
Each test below will take on the order of 44153 microseconds.
(= -2147483648 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Rate (MB/s) RMS time Min time Max time
Copy: 2448.4653 0.0553 0.0548 0.0563
Scale: 2474.4657 0.0549 0.0542 0.0559
Add: 3164.7217 0.0642 0.0636 0.0657
Triad: 3157.9160 0.0641 0.0638 0.0655

***************************************************************
Summary:

A simple OFFSET in this case placed the arrays in different DRAM banks,
and STREAM_add as well as STREAM_triad bandwidth increased by 600+ MB/s.

Next message: Petr Pospíšil: "[stream] STREAM benchmark"
Previous message: Steve Obenschain: "[stream] AMD Athlon 3200+ stream result"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

This archive was generated by hypermail 2.1.4 : Mon Jun 21 2004 - 08:35:52 CDT