In article <m4dh1bINNnee@exodus.Eng.Sun.COM>, tremblay@flayout.Eng.Sun.COM (Marc Tremblay) writes:
> In article <1993Jul16.104143.27476@is.titech.ac.jp> maeno@is.titech.ac.jp (Toshinori Maeno) writes:
> > 2. read miss penalty is 12 cycles for read, 40 cycles for write when
> >the data is only in the memory.
>
> Maybe someone from DEC can explain the discrepancy between the 12 cycles
> claimed here and the 27 cycles that was claimed in a previous message
> for a second level read miss.
>
> - Marc Tremblay.
> Sun Microsystems.
That's easy. The 12 cycle number is wrong. Dileep's message was accurate,
but it is pretty easy to draw wrong conclusions from the numbers, and quite
difficult to actually measure them.
What the hardware does (read):
1 cycle 1st level cache access
5 cycle second level cache access
27 cycle main memory access
What the programmer sees:
The program keeps running after the LD is issued, and stalls only
if the destination register of the LD is touched before the data
gets there. Consequently, in order to measure the load latency,
you have to do something like the following:
1) Perform a series of references to assure that the test reference
will be a hit or miss in the appropriate cache.
2) Do an MB to make sure the write buffers are flushed.
3) Let the pin bus become idle, to assure your test reference is
not stalled behind some other activity.
4) execute a test code sequence, consisting (typically)
of:
RCC ; read cycle counter
LD ; test load instruction
ADD ; some instruction to touch the result register
RCC ; to read the cycle counter again.
Of course it isn't that simple, since you have to know the align-
ment of the instructions, and which ones will dual-issue with which.
5) Correct the result of the measurement for the effects of the
RCC instructions.
In fact, the 21064 can have two outstanding loads. The third load will stall,
and there are some other wierd stall conditions, read the 21064 data book.
The answers given by this measurement are (I think) 3-cycle latency for
the first level cache (load-use penalty) and 8 for the external cache,
because it takes a couple of cycles to get the pipes moving and to get
the address to the pins, before you can start the cache access.
The rep-rates are 1 cycle for the internal cache and 5 for the external.
Measuring store latency is even harder. First you have to decide
what it means, since the 21064 will store up to 128 bytes of write data
in the write buffers before incuring ANY delays to the program.
One thing it might mean is how long does it take before the cells in
the DRAMS get new values. Who cares?
Another thing it might mean is how long does it take to write a value
and then read it back. (Pushing arguments and calling a procedure, which
pops them might have performance limited by such a path.)
I <think> on the 3000/400 this path requires that the write data reach
the external cache and then can be read back in via a second level cache hit.
The performance of this path is on the order of 20 cycles, and can be
measured using the cycle counter as above. Of course a good compiler
might pass arguments in registers, since reading back something you've
just written can be expensive on many modern machines.
Writes to the secondary cache take longer than reads because the chip
must make two cache accesses to do a write. The first access is a read
to check the tag store and dirty bits. The second access writes the data
and updates the dirty bit. If the relevant write buffer entry contains
data from both halves of the 32 byte cache line, then the write will take
three 5-cycle cache accesses, because the pin bus is only 16 bytes wide.
So my view is that asking "what are the read and write latencies" is
interesting, but can be simplistic, since you cannot use the answers
to predict anything in particular. You cannot add read-latencies to
calculate bandwidth, because the memory is delivering 32 bytes per access,
not just what you asked for. In any case, the memory rep-rate is NOT
the same as the latency. The situation for writes is more clear, write-latency
is very nearly uninteresting, since it has almost nothing to do with
performance. Write-bandwidth is interesting, and the fact that write
activity limits the bandwidth available for reads is interesting, but
who cares how long it takes? (other than a programmed I/O device driver.)
-Larry Stewart
-- Digital Equipment Corporation Cambridge Research Laboratory
This archive was generated by hypermail 2b29 : Tue Apr 18 2000 - 05:23:03 CDT