henning@perfom.zko.dec.com (John Henning) on 06/16/2000 04:53:01 PM
To: John D Mccalpin/Austin/IBM@IBMUS
Subject: 4cpu Compaq_AlphaServer_ES40_667
Hi John,
This is a 4-CPU AlphaServer ES40 6/667 (if that's too big a name,
you could call it Compaq_AlphaServer_ES40_667 or even [if pressed]
Compaq_Alpha_ES40_667)
You can see that I simplified the OpenMP syntax, per your suggestion.
I also added an arithmetic mean of the 4 loops; it was useful while
playing with it, and doesn't affect the timed portion.
Thanks
- John
Script started on Tue Jun 13 09:54:27 2000
% head -1 /etc/motd
Compaq Tru64 UNIX T5.1-9 (Rev. 542); Thu Apr 20 10:47:49 EDT 2000
% /usr/sbin/psrinfo -v
Status of processor 0 as of: 06/13/00 09:54:34
Processor has been on-line since 05/09/2000 14:38:24
The alpha EV6.7 (21264A) processor operates at 667 MHz,
and has an alpha internal floating point processor.
Status of processor 1 as of: 06/13/00 09:54:34
Processor has been on-line since 05/09/2000 14:38:24
The alpha EV6.7 (21264A) processor operates at 667 MHz,
and has an alpha internal floating point processor.
Status of processor 2 as of: 06/13/00 09:54:34
Processor has been on-line since 05/09/2000 14:38:24
The alpha EV6.7 (21264A) processor operates at 667 MHz,
and has an alpha internal floating point processor.
Status of processor 3 as of: 06/13/00 09:54:34
Processor has been on-line since 05/09/2000 14:38:24
The alpha EV6.7 (21264A) processor operates at 667 MHz,
and has an alpha internal floating point processor.
% what /shlib/libpthread.so | grep DECth
DECthreads version V3.18-014 Apr 9 2000
% diff stream_d.f_as_at_ftp_site_22may97 mcc_omp.f
diff stream_d.f_as_at_ftp_site_22may97 mcc_omp.f
0a1,4
> * this version 25-May-2000 j. henning compaq
> * Arrays are ALLOCATABLE; OpenMP syntax has been simplified (per
> * suggestion from John McCalpin); timer is Fortran-90 SYSTEM_CLOCK
>
52,53c56,57
< INTEGER n,offset,ndim,ntimes
< PARAMETER (n=2000000,offset=0,ndim=n+offset,ntimes=10)
--- > INTEGER*8 n,offset,ndim,ntimes,maxtimes > PARAMETER (maxtimes=10000) 57c61 < INTEGER j,k,nbpw,quantum --- > INTEGER*8 j,k,nbpw,quantum 60,62c64,66 < DOUBLE PRECISION maxtime(4),mintime(4),rmstime(4),sum(3), < $ times(4,ntimes) < INTEGER bytes(4) --- > DOUBLE PRECISION maxtime(4),mintime(4),rmstime(4), > $ sum1,sum2,sum3,times(4,maxtimes),avgbw > INTEGER*8 bytes(4) 75c79 < DOUBLE PRECISION a(ndim),b(ndim),c(ndim) --- > REAL*8, ALLOCATABLE:: a(:),b(:),c(:) 88a93,98 > PRINT *, "n, offset, ntimes" > READ *, n, offset, ntimes > ndim=n+offset > IF (ntimes .GT. maxtimes) ntimes=maxtimes > ALLOCATE (a(ndim), b(ndim), c(ndim)) > CALL defend_wrap 94c104 < $ 3*nbpw*n/ (1024*1024),' MB' --- > $ 3D0*nbpw*DBLE(n)/ (1024d0*1024),' MB' 97a108 > c$omp parallel do 103a115 > c$omp parallel do 128a141 > c$omp parallel do 135a149 > c$omp parallel do 142a157 > c$omp parallel do 149a165 > c$omp parallel do 165a182 > avgbw = 0 169a187 > avgbw = avgbw+n*bytes(j)*nbpw/mintime(j)/1.0D6 171,173c189,193 < sum(1) = 0.0d0 < sum(2) = 0.0d0 < sum(3) = 0.0d0 --- > WRITE (*,FMT=9050) "AvgBW: ", avgbw/4D0 > sum1 = 0.0d0 > sum2 = 0.0d0 > sum3 = 0.0d0 > c$omp parallel do 175,177c195,197 < sum(1) = sum(1) + a(j) < sum(2) = sum(2) + b(j) < sum(3) = sum(3) + c(j) --- > sum1 = sum1 + a(j) > sum2 = sum2 + b(j) > sum3 = sum3 + c(j) 179,181c199,201 < PRINT *,'Sum of a is = ',sum(1) < PRINT *,'Sum of b is = ',sum(2) < PRINT *,'Sum of c is = ',sum(3) --- > PRINT *,'Sum of a is = ',sum1 > PRINT *,'Sum of b is = ',sum2 > PRINT *,'Sum of c is = ',sum3 185c205 < 9020 FORMAT (1x,a,i4,a) --- > 9020 FORMAT (1x,a,f10.2,a) 334a355,374 > END > > SUBROUTINE defend_wrap > INTEGER count, count_rate, count_max > CALL SYSTEM_CLOCK ( count, count_rate, count_max ) > IF (DBLE(count) .GT. .999*DBLE(count_max)) THEN > PRINT *,"Oops, this code won't handle a wrapping system_clock" > PRINT *,"and soon we will wrap." > PRINT 4, "count:", count, "count_max:", count_max > 4 FORMAT (1X, A10, I16) > PRINT *,"Try again later, or fix the code to handle wraps." > PRINT *,"(The counter wraps approx once every 60 hours)" > STOP > END IF > END > > DOUBLE PRECISION FUNCTION second > INTEGER count, count_rate, count_max > CALL SYSTEM_CLOCK ( count, count_rate, count_max ) > second = DBLE(count)/DBLE(count_rate) % % cat !$ % cat mcc_omp.f * this version 25-May-2000 j. henning compaq * Arrays are ALLOCATABLE; OpenMP syntax has been simplified (per * suggestion from John McCalpin); timer is Fortran-90 SYSTEM_CLOCK* Program: Stream * Programmer: John D. McCalpin * Revision: 4.1, June 4, 1996 * * This program measures memory transfer rates in MB/s for simple * computational kernels coded in Fortran. These numbers reveal the * quality of code generation for simple uncacheable kernels as well * as showing the cost of floating-point operations relative to memory * accesses. * *========================================================================= * INSTRUCTIONS: * 1) Stream requires a cpu timing function called second(). * A sample is shown below. This is unfortunately rather * system dependent. The code attempts to determine the * granularity of the clock to help interpret the results. * For dedicated or parallel runs, you might want to comment * these out and compile/link with "wallclock.c". * 2) Stream requires a good bit of memory to run. * Adjust the Parameter 'N' in the main program to give * a 'timing calibration' of at least 20 clicks. * This will provide rate estimates that should be good to * about 5% precision. * ------------------------------------------------------------ * Note that you are free to use any array length and offset * that makes each array larger than the last-level cache. * The intent is to determine the *best* sustainable bandwidth * available with this simple coding. Of course, lower values * are usually fairly easy to obtain on cached machines, but * by keeping the test to the *best* results, the answers are * easier to interpret. * You may put the arrays in common or not, at your discretion. * There is a commented-out COMMON statement below. * ------------------------------------------------------------ * 3) Compile the code with full optimization. Many compilers * generate unreasonably bad code before the optimizer tightens * things up. If the results are unreasonably good, on the * other hand, the optimizer might be too smart for me * Please let me know if this happens. * 4) Mail the results to mccalpin@cs.virginia.edu * Be sure to include: * a) computer hardware model number and software revision * b) the compiler flags * c) all of the output from the test case. * * Thanks *========================================================================= * PROGRAM stream * IMPLICIT NONE C .. Parameters .. INTEGER*8 n,offset,ndim,ntimes,maxtimes PARAMETER (maxtimes=10000) C .. C .. Local Scalars .. DOUBLE PRECISION dummy,scalar,t INTEGER*8 j,k,nbpw,quantum C .. C .. Local Arrays .. DOUBLE PRECISION maxtime(4),mintime(4),rmstime(4), $ sum1,sum2,sum3,times(4,maxtimes),avgbw INTEGER*8 bytes(4) CHARACTER label(4)*11 C .. C .. External Functions .. DOUBLE PRECISION second INTEGER checktick,realsize EXTERNAL second,checktick,realsize C .. C .. Intrinsic Functions .. C INTRINSIC dble,max,min,nint,sqrt C .. C .. Arrays in Common .. REAL*8, ALLOCATABLE:: a(:),b(:),c(:) C .. C .. Common blocks .. * COMMON a,b,c C .. C .. Data statements .. DATA rmstime/4*0.0D0/,mintime/4*1.0D+36/,maxtime/4*0.0D0/ DATA label/'Copy: ','Scale: ','Add: ', $ 'Triad: '/ DATA bytes/2,2,3,3/,dummy/0.0d0/ C ..
* --- SETUP --- determine precision and check timing ---
PRINT *, "n, offset, ntimes" READ *, n, offset, ntimes ndim=n+offset IF (ntimes .GT. maxtimes) ntimes=maxtimes ALLOCATE (a(ndim), b(ndim), c(ndim)) CALL defend_wrap nbpw = realsize()
WRITE (*,FMT=9010) 'Array size = ',n WRITE (*,FMT=9010) 'Offset = ',offset WRITE (*,FMT=9020) 'The total memory requirement is ', $ 3D0*nbpw*DBLE(n)/ (1024d0*1024),' MB' WRITE (*,FMT=9030) 'You are running each test ',ntimes,' times' WRITE (*,FMT=9030) 'The *best* time for each test is used'
c$omp parallel do DO 10 j = 1,n a(j) = 1.0d0 b(j) = 2.0D0 c(j) = 0.0D0 10 CONTINUE t = second(dummy) c$omp parallel do DO 20 j = 1,n a(j) = 2.0d0*a(j) 20 CONTINUE t = second(dummy) - t PRINT *,'----------------------------------------------------' quantum = checktick() WRITE (*,FMT=9000) $ 'Your clock granularity/precision appears to be ',quantum, $ ' microseconds' PRINT *,'The tests below will each take a time on the order ' PRINT *,'of ',nint(t*1d6),' microseconds' PRINT *,' (= ',nint((t*1d6)/quantum),' clock ticks)' PRINT *,'Increase the size of the arrays if this shows that' PRINT *,'you are not getting at least 20 clock ticks per test.' PRINT *,'----------------------------------------------------' PRINT *,'WARNING -- The above is only a rough guideline.' PRINT *,'For best results, please be sure you know the' PRINT *,'precision of your system timer.' PRINT *,'----------------------------------------------------'
* --- MAIN LOOP --- repeat test cases NTIMES times --- scalar = 1.5d0*a(1) DO 70 k = 1,ntimes
t = second(dummy) c$omp parallel do DO 30 j = 1,n c(j) = a(j) 30 CONTINUE t = second(dummy) - t times(1,k) = t
t = second(dummy) c$omp parallel do DO 40 j = 1,n b(j) = scalar*c(j) 40 CONTINUE t = second(dummy) - t times(2,k) = t
t = second(dummy) c$omp parallel do DO 50 j = 1,n c(j) = a(j) + b(j) 50 CONTINUE t = second(dummy) - t times(3,k) = t
t = second(dummy) c$omp parallel do DO 60 j = 1,n a(j) = b(j) + scalar*c(j) 60 CONTINUE t = second(dummy) - t times(4,k) = t 70 CONTINUE
* --- SUMMARY --- DO 90 k = 1,ntimes DO 80 j = 1,4 rmstime(j) = rmstime(j) + times(j,k)**2 mintime(j) = min(mintime(j),times(j,k)) maxtime(j) = max(maxtime(j),times(j,k)) 80 CONTINUE 90 CONTINUE WRITE (*,FMT=9040) avgbw = 0 DO 100 j = 1,4 rmstime(j) = sqrt(rmstime(j)/dble(ntimes)) WRITE (*,FMT=9050) label(j),n*bytes(j)*nbpw/mintime(j)/1.0D6, $ rmstime(j),mintime(j),maxtime(j) avgbw = avgbw+n*bytes(j)*nbpw/mintime(j)/1.0D6 100 CONTINUE WRITE (*,FMT=9050) "AvgBW: ", avgbw/4D0 sum1 = 0.0d0 sum2 = 0.0d0 sum3 = 0.0d0 c$omp parallel do DO 110 j = 1,n sum1 = sum1 + a(j) sum2 = sum2 + b(j) sum3 = sum3 + c(j) 110 CONTINUE PRINT *,'Sum of a is = ',sum1 PRINT *,'Sum of b is = ',sum2 PRINT *,'Sum of c is = ',sum3
9000 FORMAT (1x,a,i6,a) 9010 FORMAT (1x,a,i10) 9020 FORMAT (1x,a,f10.2,a) 9030 FORMAT (1x,a,i3,a,a) 9040 FORMAT ('Function',5x,'Rate (MB/s) RMS time Min time Max time' $ ) 9050 FORMAT (a,4 (f10.4,2x)) END
*------------------------------------- * INTEGER FUNCTION dblesize() * * A semi-portable way to determine the precision of DOUBLE PRECISION * in Fortran. * Here used to guess how many bytes of storage a DOUBLE PRECISION * number occupies. * INTEGER FUNCTION realsize() * IMPLICIT NONE
C .. Local Scalars .. DOUBLE PRECISION result,test INTEGER j,ndigits C .. C .. Local Arrays .. DOUBLE PRECISION ref(30) C .. C .. External Subroutines .. EXTERNAL confuse C .. C .. Intrinsic Functions .. INTRINSIC abs,acos,log10,sqrt C ..
C Test #1 - compare single(1.0d0+delta) to 1.0d0
10 DO 20 j = 1,30 ref(j) = 1.0d0 + 10.0d0** (-j) 20 CONTINUE
DO 30 j = 1,30 test = ref(j) ndigits = j CALL confuse(test,result) IF (test.EQ.1.0D0) THEN GO TO 40 END IF 30 CONTINUE GO TO 50
40 WRITE (*,FMT='(a)') $ '----------------------------------------------' WRITE (*,FMT='(1x,a,i2,a)') 'Double precision appears to have ', $ ndigits,' digits of accuracy' IF (ndigits.LE.8) THEN realsize = 4 ELSE realsize = 8 END IF WRITE (*,FMT='(1x,a,i1,a)') 'Assuming ',realsize, $ ' bytes per DOUBLE PRECISION word' WRITE (*,FMT='(a)') $ '----------------------------------------------' RETURN
50 PRINT *,'Hmmmm. I am unable to determine the size.' PRINT *,'Please enter the number of Bytes per DOUBLE PRECISION', $ ' number : ' READ (*,FMT=*) realsize IF (realsize.NE.4 .AND. realsize.NE.8) THEN PRINT *,'Your answer ',realsize,' does not make sense.' PRINT *,'Try again.' PRINT *,'Please enter the number of Bytes per ', $ 'DOUBLE PRECISION number : ' READ (*,FMT=*) realsize END IF PRINT *,'You have manually entered a size of ',realsize, $ ' bytes per DOUBLE PRECISION number' WRITE (*,FMT='(a)') $ '----------------------------------------------' END
SUBROUTINE confuse(q,r) * IMPLICIT NONE C .. Scalar Arguments .. DOUBLE PRECISION q,r C .. C .. Intrinsic Functions .. INTRINSIC cos C .. r = cos(q) RETURN END
* A semi-portable way to determine the clock granularity * Adapted from a code by John Henning of Digital Equipment Corporation * INTEGER FUNCTION checktick() * IMPLICIT NONE
C .. Parameters .. INTEGER n PARAMETER (n=20) C .. C .. Local Scalars .. DOUBLE PRECISION dummy,t1,t2 INTEGER i,j,jmin C .. C .. Local Arrays .. DOUBLE PRECISION timesfound(n) C .. C .. External Functions .. DOUBLE PRECISION second EXTERNAL second C .. C .. Intrinsic Functions .. INTRINSIC max,min,nint C .. i = 0 dummy = 0.0d0 t1 = second(dummy)
10 t2 = second(dummy) IF (t2.EQ.t1) GO TO 10
t1 = t2 i = i + 1 timesfound(i) = t1 IF (i.LT.n) GO TO 10
jmin = 1000000 DO 20 i = 2,n j = nint((timesfound(i)-timesfound(i-1))*1d6) jmin = min(jmin,max(j,0)) 20 CONTINUE
IF (jmin.GT.0) THEN checktick = jmin ELSE PRINT *,'Your clock granularity appears to be less ', $ 'than one microsecond' checktick = 1 END IF RETURN
* PRINT 14, timesfound(1)*1d6 * DO 20 i=2,n * PRINT 14, timesfound(i)*1d6, * & nint((timesfound(i)-timesfound(i-1))*1d6) * 14 FORMAT (1X, F18.4, 1X, i8) * 20 CONTINUE
END
SUBROUTINE defend_wrap INTEGER count, count_rate, count_max CALL SYSTEM_CLOCK ( count, count_rate, count_max ) IF (DBLE(count) .GT. .999*DBLE(count_max)) THEN PRINT *,"Oops, this code won't handle a wrapping system_clock" PRINT *,"and soon we will wrap." PRINT 4, "count:", count, "count_max:", count_max 4 FORMAT (1X, A10, I16) PRINT *,"Try again later, or fix the code to handle wraps." PRINT *,"(The counter wraps approx once every 60 hours)" STOP END IF END
DOUBLE PRECISION FUNCTION second INTEGER count, count_rate, count_max CALL SYSTEM_CLOCK ( count, count_rate, count_max ) second = DBLE(count)/DBLE(count_rate) END % % cat buildit.csh #!/bin/csh set verbose unlimit f90 -v -omp -source_listing -machine_code \ -o mcc_omp_`date +%Y%m%d` \ -fast -O5 -unroll 32 -arch ev6 \ mcc_omp.f grep COMPILER: mcc_omp.lis % % ./buildit.csh unlimit f90 -v -omp -source_listing -machine_code -o mcc_omp_`date +%Y%m%d` -fast -O5 -unroll 32 -arch ev6 mcc_omp.f /usr/lib/cmplrs/fort90/decfort90 -machine_code -fast -O5 -unroll 32 -arch ev6 -I/usr/lib/cmplrs/hpfrtl -omp -reentrancy threaded -automatic -source_listing -o /tmp/forAAAaeefga.o mcc_omp.f /usr/bin/cc -v -o mcc_omp_20000613 -arch ev6 /usr/lib/cmplrs/fort90/for_main.o -source_listing /tmp/forAAAaeefga.o -O4 -pthread -qlshpf -lUfor -lfor -lFutil -lm -lots3 -lots -lm_c32
/usr/lib/cmplrs/cc.dtk/ld -o mcc_omp_20000613 -g0 -O4 -call_shared /usr/lib/cmplrs/cc.dtk/crt0.o /usr/lib/cmplrs/fort90/for_main.o /tmp/forAAAaeefga.o -qlshpf_r -qlshpf -qlUfor_r -lUfor -qlfor_r -lfor -qlFutil_r -lFutil -qlm_r -lm -qlots3_r -lots3 -qlots_r -lots -qlm_c32_r -lm_c32 -lpthread -lexc -lc /usr/lib/cmplrs/cc.dtk/ld: 0.01u 0.02s 0:00 4% 0+16k 91+23io 0pf+0w 16stk+2272mem grep COMPILER: mcc_omp.lis COMPILER: Compaq Fortran V5.3-915-449BB % ls buildit.csh mcc_omp_20000613 mcc_omp.f stream_d.f_as_at_ftp_site_22may97 mcc_omp.lis typescript % ./mcc_omp_20000613 n, offset, ntimes 8005626,0,10\ forrtl: severe (59): list-directed I/O syntax error, unit -4, file /dev/pts/3 % ./mcc_omp_20000613 n, offset, ntimes 8005626,0,10 ---------------------------------------------- Double precision appears to have 16 digits of accuracy Assuming 8 bytes per DOUBLE PRECISION word ---------------------------------------------- Array size = 8005626 Offset = 0 The total memory requirement is 183.23 MB You are running each test 10 times The *best* time for each test is used ---------------------------------------------------- Your clock granularity/precision appears to be 100 microseconds The tests below will each take a time on the order of 40600 microseconds (= 406 clock ticks) Increase the size of the arrays if this shows that you are not getting at least 20 clock ticks per test. ---------------------------------------------------- WARNING -- The above is only a rough guideline. For best results, please be sure you know the precision of your system timer. ---------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Copy: 2472.7802 0.0632 0.0518 0.1165 Scale: 2363.2844 0.0552 0.0542 0.0570 Add: 2314.8798 0.0841 0.0830 0.0895 Triad: 2369.1125 0.0818 0.0811 0.0856 AvgBW: 2380.0142 Sum of a is = 2.308224256790819E+018 Sum of b is = 4.616448513254510E+017 Sum of c is = 6.155264684672618E+017 % exit % script done on Tue Jun 13 09:56:07 2000
This archive was generated by hypermail 2b29 : Sat Jun 17 2000 - 05:16:38 CDT