-- John D. McCalpin, Ph.D. mccalpin@austin.ibm.com Senior Scientist IBM POWER Microprocessor Development "I am willing to make mistakes as long as someone else is willing to learn from them."
attached mail follows:
Hi John,
Two problems: (1) Your web site currently has two different values for the DS10 and DS10L. This should not be true. (2) Furthermore, the values I've measured with the latest rev DS10 do not match EITHER of the two values you have posted.
These confusing results stimulated me to do a massive number of experiments - on various different combinations of system, firmware, software, memory config, etc.
I believe the attached is a more accurate representation of the currently shipping product than either of the existing postings. If you are willing, I'd suggest calling it:
Compaq_AlphaServer_DS10/DS10L
or if space does not permit the above, then how about
Compaq_Alpha_DS10/DS10L
Please *remove* the result that I submitted 5/26/99, http://www.cs.virginia.edu/stream/stream_mail/1999/0026.html . Or if you don't like removing old history, you could move it to the experimental table or somehow rename it with something like "older rev" or "obsolete firmware".
The other existing result is from Greg Lindahl. Greg wrote in http://www.cs.virginia.edu/stream/stream_mail/2000/0008.html that he thought that his slower result was due to hardware. Actually, it's probably due to the fact that he was using Linux, and the compiler version he had may have been slightly less aggressive in its use of prefecthing. It's not hardware; I have it from an authoritative member of the engineering team that the DS10 and DS10L memory systems are the same, in all aspects except capacity. The former can hold 2GB; the latter can hold only 1GB. So perhaps you could mark Greg's result as something like
Compaq_Alpha_DS10_Linux
You'll note that the source below is identical to the source for the recent ES40 submissions; I just compiled it WITHOUT "-omp", so those parallel directives are treated as comments.
Thanks - John Henning
Script started on Tue Jun 27 05:00:51 2000 % /usr/sbin/psrinfo -v Status of processor 0 as of: 06/27/00 05:00:55 Processor has been on-line since 06/21/2000 11:15:15 The alpha EV6 (21264) processor operates at 463 MHz, and has an alpha internal floating point processor. % diff stream_d.f_as_at_ftp_site_22may97 mcc_omp.f 0a1,4 > * this version 25-May-2000 j. henning compaq > * Arrays are ALLOCATABLE; OpenMP syntax has been simplified (per > * suggestion from John McCalpin); timer is Fortran-90 SYSTEM_CLOCK > 52,53c56,57 < INTEGER n,offset,ndim,ntimes < PARAMETER (n=2000000,offset=0,ndim=n+offset,ntimes=10)
--- > INTEGER*8 n,offset,ndim,ntimes,maxtimes > PARAMETER (maxtimes=10000) 57c61 < INTEGER j,k,nbpw,quantum --- > INTEGER*8 j,k,nbpw,quantum 60,62c64,66 < DOUBLE PRECISION maxtime(4),mintime(4),rmstime(4),sum(3), < $ times(4,ntimes) < INTEGER bytes(4) --- > DOUBLE PRECISION maxtime(4),mintime(4),rmstime(4), > $ sum1,sum2,sum3,times(4,maxtimes),avgbw > INTEGER*8 bytes(4) 75c79 < DOUBLE PRECISION a(ndim),b(ndim),c(ndim) --- > REAL*8, ALLOCATABLE:: a(:),b(:),c(:) 88a93,98 > PRINT *, "n, offset, ntimes" > READ *, n, offset, ntimes > ndim=n+offset > IF (ntimes .GT. maxtimes) ntimes=maxtimes > ALLOCATE (a(ndim), b(ndim), c(ndim)) > CALL defend_wrap 94c104 < $ 3*nbpw*n/ (1024*1024),' MB' --- > $ 3D0*nbpw*DBLE(n)/ (1024d0*1024),' MB' 97a108 > c$omp parallel do 103a115 > c$omp parallel do 128a141 > c$omp parallel do 135a149 > c$omp parallel do 142a157 > c$omp parallel do 149a165 > c$omp parallel do 165a182 > avgbw = 0 169a187 > avgbw = avgbw+n*bytes(j)*nbpw/mintime(j)/1.0D6 171,173c189,193 < sum(1) = 0.0d0 < sum(2) = 0.0d0 < sum(3) = 0.0d0 --- > WRITE (*,FMT=9050) "AvgBW: ", avgbw/4D0 > sum1 = 0.0d0 > sum2 = 0.0d0 > sum3 = 0.0d0 > c$omp parallel do 175,177c195,197 < sum(1) = sum(1) + a(j) < sum(2) = sum(2) + b(j) < sum(3) = sum(3) + c(j) --- > sum1 = sum1 + a(j) > sum2 = sum2 + b(j) > sum3 = sum3 + c(j) 179,181c199,201 < PRINT *,'Sum of a is = ',sum(1) < PRINT *,'Sum of b is = ',sum(2) < PRINT *,'Sum of c is = ',sum(3) --- > PRINT *,'Sum of a is = ',sum1 > PRINT *,'Sum of b is = ',sum2 > PRINT *,'Sum of c is = ',sum3 185c205 < 9020 FORMAT (1x,a,i4,a) --- > 9020 FORMAT (1x,a,f10.2,a) 334a355,374 > END > > SUBROUTINE defend_wrap > INTEGER count, count_rate, count_max > CALL SYSTEM_CLOCK ( count, count_rate, count_max ) > IF (DBLE(count) .GT. .999*DBLE(count_max)) THEN > PRINT *,"Oops, this code won't handle a wrapping system_clock" > PRINT *,"and soon we will wrap." > PRINT 4, "count:", count, "count_max:", count_max > 4 FORMAT (1X, A10, I16) > PRINT *,"Try again later, or fix the code to handle wraps." > PRINT *,"(The counter wraps approx once every 60 hours)" > STOP > END IF > END > > DOUBLE PRECISION FUNCTION second > INTEGER count, count_rate, count_max > CALL SYSTEM_CLOCK ( count, count_rate, count_max ) > second = DBLE(count)/DBLE(count_rate) % cat !$ % cat mcc_omp.f * this version 25-May-2000 j. henning compaq * Arrays are ALLOCATABLE; OpenMP syntax has been simplified (per * suggestion from John McCalpin); timer is Fortran-90 SYSTEM_CLOCK* Program: Stream * Programmer: John D. McCalpin * Revision: 4.1, June 4, 1996 * * This program measures memory transfer rates in MB/s for simple * computational kernels coded in Fortran. These numbers reveal the * quality of code generation for simple uncacheable kernels as well * as showing the cost of floating-point operations relative to memory * accesses. * *========================================================================= * INSTRUCTIONS: * 1) Stream requires a cpu timing function called second(). * A sample is shown below. This is unfortunately rather * system dependent. The code attempts to determine the * granularity of the clock to help interpret the results. * For dedicated or parallel runs, you might want to comment * these out and compile/link with "wallclock.c". * 2) Stream requires a good bit of memory to run. * Adjust the Parameter 'N' in the main program to give * a 'timing calibration' of at least 20 clicks. * This will provide rate estimates that should be good to * about 5% precision. * ------------------------------------------------------------ * Note that you are free to use any array length and offset * that makes each array larger than the last-level cache. * The intent is to determine the *best* sustainable bandwidth * available with this simple coding. Of course, lower values * are usually fairly easy to obtain on cached machines, but * by keeping the test to the *best* results, the answers are * easier to interpret. * You may put the arrays in common or not, at your discretion. * There is a commented-out COMMON statement below. * ------------------------------------------------------------ * 3) Compile the code with full optimization. Many compilers * generate unreasonably bad code before the optimizer tightens * things up. If the results are unreasonably good, on the * other hand, the optimizer might be too smart for me * Please let me know if this happens. * 4) Mail the results to mccalpin@cs.virginia.edu * Be sure to include: * a) computer hardware model number and software revision * b) the compiler flags * c) all of the output from the test case. * * Thanks *========================================================================= * PROGRAM stream * IMPLICIT NONE C .. Parameters .. INTEGER*8 n,offset,ndim,ntimes,maxtimes PARAMETER (maxtimes=10000) C .. C .. Local Scalars .. DOUBLE PRECISION dummy,scalar,t INTEGER*8 j,k,nbpw,quantum C .. C .. Local Arrays .. DOUBLE PRECISION maxtime(4),mintime(4),rmstime(4), $ sum1,sum2,sum3,times(4,maxtimes),avgbw INTEGER*8 bytes(4) CHARACTER label(4)*11 C .. C .. External Functions .. DOUBLE PRECISION second INTEGER checktick,realsize EXTERNAL second,checktick,realsize C .. C .. Intrinsic Functions .. C INTRINSIC dble,max,min,nint,sqrt C .. C .. Arrays in Common .. REAL*8, ALLOCATABLE:: a(:),b(:),c(:) C .. C .. Common blocks .. * COMMON a,b,c C .. C .. Data statements .. DATA rmstime/4*0.0D0/,mintime/4*1.0D+36/,maxtime/4*0.0D0/ DATA label/'Copy: ','Scale: ','Add: ', $ 'Triad: '/ DATA bytes/2,2,3,3/,dummy/0.0d0/ C ..
* --- SETUP --- determine precision and check timing ---
PRINT *, "n, offset, ntimes" READ *, n, offset, ntimes ndim=n+offset IF (ntimes .GT. maxtimes) ntimes=maxtimes ALLOCATE (a(ndim), b(ndim), c(ndim)) CALL defend_wrap nbpw = realsize()
WRITE (*,FMT=9010) 'Array size = ',n WRITE (*,FMT=9010) 'Offset = ',offset WRITE (*,FMT=9020) 'The total memory requirement is ', $ 3D0*nbpw*DBLE(n)/ (1024d0*1024),' MB' WRITE (*,FMT=9030) 'You are running each test ',ntimes,' times' WRITE (*,FMT=9030) 'The *best* time for each test is used'
c$omp parallel do DO 10 j = 1,n a(j) = 1.0d0 b(j) = 2.0D0 c(j) = 0.0D0 10 CONTINUE t = second(dummy) c$omp parallel do DO 20 j = 1,n a(j) = 2.0d0*a(j) 20 CONTINUE t = second(dummy) - t PRINT *,'----------------------------------------------------' quantum = checktick() WRITE (*,FMT=9000) $ 'Your clock granularity/precision appears to be ',quantum, $ ' microseconds' PRINT *,'The tests below will each take a time on the order ' PRINT *,'of ',nint(t*1d6),' microseconds' PRINT *,' (= ',nint((t*1d6)/quantum),' clock ticks)' PRINT *,'Increase the size of the arrays if this shows that' PRINT *,'you are not getting at least 20 clock ticks per test.' PRINT *,'----------------------------------------------------' PRINT *,'WARNING -- The above is only a rough guideline.' PRINT *,'For best results, please be sure you know the' PRINT *,'precision of your system timer.' PRINT *,'----------------------------------------------------'
* --- MAIN LOOP --- repeat test cases NTIMES times --- scalar = 1.5d0*a(1) DO 70 k = 1,ntimes
t = second(dummy) c$omp parallel do DO 30 j = 1,n c(j) = a(j) 30 CONTINUE t = second(dummy) - t times(1,k) = t
t = second(dummy) c$omp parallel do DO 40 j = 1,n b(j) = scalar*c(j) 40 CONTINUE t = second(dummy) - t times(2,k) = t
t = second(dummy) c$omp parallel do DO 50 j = 1,n c(j) = a(j) + b(j) 50 CONTINUE t = second(dummy) - t times(3,k) = t
t = second(dummy) c$omp parallel do DO 60 j = 1,n a(j) = b(j) + scalar*c(j) 60 CONTINUE t = second(dummy) - t times(4,k) = t 70 CONTINUE
* --- SUMMARY --- DO 90 k = 1,ntimes DO 80 j = 1,4 rmstime(j) = rmstime(j) + times(j,k)**2 mintime(j) = min(mintime(j),times(j,k)) maxtime(j) = max(maxtime(j),times(j,k)) 80 CONTINUE 90 CONTINUE WRITE (*,FMT=9040) avgbw = 0 DO 100 j = 1,4 rmstime(j) = sqrt(rmstime(j)/dble(ntimes)) WRITE (*,FMT=9050) label(j),n*bytes(j)*nbpw/mintime(j)/1.0D6, $ rmstime(j),mintime(j),maxtime(j) avgbw = avgbw+n*bytes(j)*nbpw/mintime(j)/1.0D6 100 CONTINUE WRITE (*,FMT=9050) "AvgBW: ", avgbw/4D0 sum1 = 0.0d0 sum2 = 0.0d0 sum3 = 0.0d0 c$omp parallel do DO 110 j = 1,n sum1 = sum1 + a(j) sum2 = sum2 + b(j) sum3 = sum3 + c(j) 110 CONTINUE PRINT *,'Sum of a is = ',sum1 PRINT *,'Sum of b is = ',sum2 PRINT *,'Sum of c is = ',sum3
9000 FORMAT (1x,a,i6,a) 9010 FORMAT (1x,a,i10) 9020 FORMAT (1x,a,f10.2,a) 9030 FORMAT (1x,a,i3,a,a) 9040 FORMAT ('Function',5x,'Rate (MB/s) RMS time Min time Max time' $ ) 9050 FORMAT (a,4 (f10.4,2x)) END
*------------------------------------- * INTEGER FUNCTION dblesize() * * A semi-portable way to determine the precision of DOUBLE PRECISION * in Fortran. * Here used to guess how many bytes of storage a DOUBLE PRECISION * number occupies. * INTEGER FUNCTION realsize() * IMPLICIT NONE
C .. Local Scalars .. DOUBLE PRECISION result,test INTEGER j,ndigits C .. C .. Local Arrays .. DOUBLE PRECISION ref(30) C .. C .. External Subroutines .. EXTERNAL confuse C .. C .. Intrinsic Functions .. INTRINSIC abs,acos,log10,sqrt C ..
C Test #1 - compare single(1.0d0+delta) to 1.0d0
10 DO 20 j = 1,30 ref(j) = 1.0d0 + 10.0d0** (-j) 20 CONTINUE
DO 30 j = 1,30 test = ref(j) ndigits = j CALL confuse(test,result) IF (test.EQ.1.0D0) THEN GO TO 40 END IF 30 CONTINUE GO TO 50
40 WRITE (*,FMT='(a)') $ '----------------------------------------------' WRITE (*,FMT='(1x,a,i2,a)') 'Double precision appears to have ', $ ndigits,' digits of accuracy' IF (ndigits.LE.8) THEN realsize = 4 ELSE realsize = 8 END IF WRITE (*,FMT='(1x,a,i1,a)') 'Assuming ',realsize, $ ' bytes per DOUBLE PRECISION word' WRITE (*,FMT='(a)') $ '----------------------------------------------' RETURN
50 PRINT *,'Hmmmm. I am unable to determine the size.' PRINT *,'Please enter the number of Bytes per DOUBLE PRECISION', $ ' number : ' READ (*,FMT=*) realsize IF (realsize.NE.4 .AND. realsize.NE.8) THEN PRINT *,'Your answer ',realsize,' does not make sense.' PRINT *,'Try again.' PRINT *,'Please enter the number of Bytes per ', $ 'DOUBLE PRECISION number : ' READ (*,FMT=*) realsize END IF PRINT *,'You have manually entered a size of ',realsize, $ ' bytes per DOUBLE PRECISION number' WRITE (*,FMT='(a)') $ '----------------------------------------------' END
SUBROUTINE confuse(q,r) * IMPLICIT NONE C .. Scalar Arguments .. DOUBLE PRECISION q,r C .. C .. Intrinsic Functions .. INTRINSIC cos C .. r = cos(q) RETURN END
* A semi-portable way to determine the clock granularity * Adapted from a code by John Henning of Digital Equipment Corporation * INTEGER FUNCTION checktick() * IMPLICIT NONE
C .. Parameters .. INTEGER n PARAMETER (n=20) C .. C .. Local Scalars .. DOUBLE PRECISION dummy,t1,t2 INTEGER i,j,jmin C .. C .. Local Arrays .. DOUBLE PRECISION timesfound(n) C .. C .. External Functions .. DOUBLE PRECISION second EXTERNAL second C .. C .. Intrinsic Functions .. INTRINSIC max,min,nint C .. i = 0 dummy = 0.0d0 t1 = second(dummy)
10 t2 = second(dummy) IF (t2.EQ.t1) GO TO 10
t1 = t2 i = i + 1 timesfound(i) = t1 IF (i.LT.n) GO TO 10
jmin = 1000000 DO 20 i = 2,n j = nint((timesfound(i)-timesfound(i-1))*1d6) jmin = min(jmin,max(j,0)) 20 CONTINUE
IF (jmin.GT.0) THEN checktick = jmin ELSE PRINT *,'Your clock granularity appears to be less ', $ 'than one microsecond' checktick = 1 END IF RETURN
* PRINT 14, timesfound(1)*1d6 * DO 20 i=2,n * PRINT 14, timesfound(i)*1d6, * & nint((timesfound(i)-timesfound(i-1))*1d6) * 14 FORMAT (1X, F18.4, 1X, i8) * 20 CONTINUE
END
SUBROUTINE defend_wrap INTEGER count, count_rate, count_max CALL SYSTEM_CLOCK ( count, count_rate, count_max ) IF (DBLE(count) .GT. .999*DBLE(count_max)) THEN PRINT *,"Oops, this code won't handle a wrapping system_clock" PRINT *,"and soon we will wrap." PRINT 4, "count:", count, "count_max:", count_max 4 FORMAT (1X, A10, I16) PRINT *,"Try again later, or fix the code to handle wraps." PRINT *,"(The counter wraps approx once every 60 hours)" STOP END IF END
DOUBLE PRECISION FUNCTION second INTEGER count, count_rate, count_max CALL SYSTEM_CLOCK ( count, count_rate, count_max ) second = DBLE(count)/DBLE(count_rate) END % cat buildit_noomp.csh cat buildit_noomp.csh #!/bin/csh set verbose unlimit f90 -v -source_listing -machine_code \ -o mcc_noomp_`date +%Y%m%d` \ -fast -O5 -unroll 32 -arch ev6 \ mcc_omp.f grep COMPILER: mcc_omp.lis % % ./!$ % ./buildit_noomp.csh unlimit f90 -v -source_listing -machine_code -o mcc_noomp_`date +%Y%m%d` -fast -O5 -unroll 32 -arch ev6 mcc_omp.f /usr/lib/cmplrs/fort90/decfort90 -machine_code -fast -O5 -unroll 32 -arch ev6 -I/usr/lib/cmplrs/hpfrtl -source_listing -o /tmp/forAAAabcuna.o mcc_omp.f /usr/bin/cc -v -o mcc_noomp_20000627 -arch ev6 /usr/lib/cmplrs/fort90/for_main.o -source_listing /tmp/forAAAabcuna.o -O4 -qlshpf -lUfor -lfor -lFutil -lm -lots -lm_c32
/usr/lib/cmplrs/cc/ld -o mcc_noomp_20000627 -g0 -O4 -call_shared /usr/lib/cmplrs/cc/crt0.o /usr/lib/cmplrs/fort90/for_main.o /tmp/forAAAabcuna.o -qlshpf -lUfor -lfor -lFutil -lm -lots -lm_c32 -lc /usr/lib/cmplrs/cc/ld: 0.01u 0.01s 0:00 40% 0+19k 0+12io 0pf+0w 19stk+2008mem grep COMPILER: mcc_omp.lis COMPILER: Compaq Fortran V5.3-915-449BB % % ls buildit_noomp.csh mcc_omp.lis mcc_noomp_20000627 stream_d.f_as_at_ftp_site_22may97 mcc_omp.f typescript % ./mcc_noomp_20000627 n, offset, ntimes 1008075,0,10 ---------------------------------------------- Double precision appears to have 16 digits of accuracy Assuming 8 bytes per DOUBLE PRECISION word ---------------------------------------------- Array size = 1008075 Offset = 0 The total memory requirement is 23.07 MB You are running each test 10 times The *best* time for each test is used ---------------------------------------------------- Your clock granularity/precision appears to be 100 microseconds The tests below will each take a time on the order of 18100 microseconds (= 181 clock ticks) Increase the size of the arrays if this shows that you are not getting at least 20 clock ticks per test. ---------------------------------------------------- WARNING -- The above is only a rough guideline. For best results, please be sure you know the precision of your system timer. ---------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Copy: 806.4600 0.0202 0.0200 0.0208 Scale: 798.4752 0.0203 0.0202 0.0206 Add: 763.2114 0.0318 0.0317 0.0320 Triad: 777.9357 0.0313 0.0311 0.0317 AvgBW: 786.5206 Sum of a is = 1.162613685089136E+018 Sum of b is = 2.325227370154242E+017 Sum of c is = 3.100303160218152E+017 % exit % script done on Tue Jun 27 05:01:42 2000
This archive was generated by hypermail 2b29 : Mon Jul 17 2000 - 04:46:07 CDT