4cpu Compaq_AlphaServer_ES40_667

From: John Henning (henning@perfom.zko.dec.com)
Date: Fri Jun 16 2000 - 17:05:05 CDT

Next message: John Henning: "1cpu Compaq_AlphaServer_GS320 and Compaq_AlphaServer_GS160"

Previous message: John Henning: "2cpu Compaq_AlphaServer_ES40_667"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

henning@perfom.zko.dec.com (John Henning) on 06/16/2000 04:53:01 PM

To: John D Mccalpin/Austin/IBM@IBMUS
Subject: 4cpu Compaq_AlphaServer_ES40_667

Hi John,
   This is a 4-CPU AlphaServer ES40 6/667 (if that's too big a name,
   you could call it Compaq_AlphaServer_ES40_667 or even [if pressed]
   Compaq_Alpha_ES40_667)

You can see that I simplified the OpenMP syntax, per your suggestion.
I also added an arithmetic mean of the 4 loops; it was useful while
playing with it, and doesn't affect the timed portion.

Thanks
- John

Script started on Tue Jun 13 09:54:27 2000
% head -1 /etc/motd
Compaq Tru64 UNIX T5.1-9 (Rev. 542); Thu Apr 20 10:47:49 EDT 2000
% /usr/sbin/psrinfo -v
Status of processor 0 as of: 06/13/00 09:54:34
  Processor has been on-line since 05/09/2000 14:38:24
  The alpha EV6.7 (21264A) processor operates at 667 MHz,
     and has an alpha internal floating point processor.
Status of processor 1 as of: 06/13/00 09:54:34
  Processor has been on-line since 05/09/2000 14:38:24
  The alpha EV6.7 (21264A) processor operates at 667 MHz,
     and has an alpha internal floating point processor.
Status of processor 2 as of: 06/13/00 09:54:34
  Processor has been on-line since 05/09/2000 14:38:24
  The alpha EV6.7 (21264A) processor operates at 667 MHz,
     and has an alpha internal floating point processor.
Status of processor 3 as of: 06/13/00 09:54:34
  Processor has been on-line since 05/09/2000 14:38:24
  The alpha EV6.7 (21264A) processor operates at 667 MHz,
     and has an alpha internal floating point processor.
% what /shlib/libpthread.so | grep DECth
      DECthreads version V3.18-014 Apr 9 2000
% diff stream_d.f_as_at_ftp_site_22may97 mcc_omp.f
diff stream_d.f_as_at_ftp_site_22may97 mcc_omp.f
0a1,4
> * this version 25-May-2000 j. henning compaq
> * Arrays are ALLOCATABLE; OpenMP syntax has been simplified (per
> * suggestion from John McCalpin); timer is Fortran-90 SYSTEM_CLOCK
>
52,53c56,57
< INTEGER n,offset,ndim,ntimes
< PARAMETER (n=2000000,offset=0,ndim=n+offset,ntimes=10)

---
>       INTEGER*8 n,offset,ndim,ntimes,maxtimes
>       PARAMETER (maxtimes=10000)
57c61
<       INTEGER j,k,nbpw,quantum
---
>       INTEGER*8 j,k,nbpw,quantum
60,62c64,66
<       DOUBLE PRECISION maxtime(4),mintime(4),rmstime(4),sum(3),
<      $                 times(4,ntimes)
<       INTEGER bytes(4)
---
>       DOUBLE PRECISION maxtime(4),mintime(4),rmstime(4),
>      $                 sum1,sum2,sum3,times(4,maxtimes),avgbw
>       INTEGER*8 bytes(4)
75c79
<       DOUBLE PRECISION a(ndim),b(ndim),c(ndim)
---
>       REAL*8, ALLOCATABLE:: a(:),b(:),c(:)
88a93,98
>       PRINT *, "n, offset, ntimes"
>       READ *, n, offset, ntimes
>       ndim=n+offset
>       IF (ntimes .GT. maxtimes) ntimes=maxtimes
>       ALLOCATE (a(ndim), b(ndim), c(ndim))
>       CALL defend_wrap
94c104
<      $  3*nbpw*n/ (1024*1024),' MB'
---
>      $  3D0*nbpw*DBLE(n)/ (1024d0*1024),' MB'
97a108
> c$omp parallel do
103a115
> c$omp parallel do
128a141
> c$omp parallel do
135a149
> c$omp parallel do
142a157
> c$omp parallel do
149a165
> c$omp parallel do
165a182
>       avgbw = 0
169a187
>           avgbw = avgbw+n*bytes(j)*nbpw/mintime(j)/1.0D6
171,173c189,193
<       sum(1) = 0.0d0
<       sum(2) = 0.0d0
<       sum(3) = 0.0d0
---
>       WRITE (*,FMT=9050) "AvgBW:     ", avgbw/4D0
>       sum1 = 0.0d0
>       sum2 = 0.0d0
>       sum3 = 0.0d0
> c$omp parallel do
175,177c195,197
<           sum(1) = sum(1) + a(j)
<           sum(2) = sum(2) + b(j)
<           sum(3) = sum(3) + c(j)
---
>           sum1 = sum1 + a(j)
>           sum2 = sum2 + b(j)
>           sum3 = sum3 + c(j)
179,181c199,201
<       PRINT *,'Sum of a is = ',sum(1)
<       PRINT *,'Sum of b is = ',sum(2)
<       PRINT *,'Sum of c is = ',sum(3)
---
>       PRINT *,'Sum of a is = ',sum1
>       PRINT *,'Sum of b is = ',sum2
>       PRINT *,'Sum of c is = ',sum3
185c205
<  9020 FORMAT (1x,a,i4,a)
---
>  9020 FORMAT (1x,a,f10.2,a)
334a355,374
>       END
>
>       SUBROUTINE defend_wrap
>       INTEGER count, count_rate, count_max
>       CALL SYSTEM_CLOCK ( count, count_rate, count_max )
>       IF (DBLE(count) .GT. .999*DBLE(count_max)) THEN
>          PRINT *,"Oops, this code won't handle a wrapping system_clock"
>          PRINT *,"and soon we will wrap."
>          PRINT 4, "count:", count, "count_max:", count_max
>   4      FORMAT (1X, A10, I16)
>          PRINT *,"Try again later, or fix the code to handle wraps."
>          PRINT *,"(The counter wraps approx once every 60 hours)"
>          STOP
>       END IF
>       END
>
>       DOUBLE PRECISION FUNCTION second
>       INTEGER count, count_rate, count_max
>       CALL SYSTEM_CLOCK ( count, count_rate, count_max )
>       second = DBLE(count)/DBLE(count_rate)
%
% cat !$
% cat mcc_omp.f
* this version 25-May-2000 j. henning compaq
* Arrays are ALLOCATABLE; OpenMP syntax has been simplified (per
* suggestion from John McCalpin); timer is Fortran-90 SYSTEM_CLOCK
* Program: Stream
* Programmer: John D. McCalpin
* Revision: 4.1, June 4, 1996
*
* This program measures memory transfer rates in MB/s for simple
* computational kernels coded in Fortran.  These numbers reveal the
* quality of code generation for simple uncacheable kernels as well
* as showing the cost of floating-point operations relative to memory
* accesses.
*
*=========================================================================
* INSTRUCTIONS:
*       1) Stream requires a cpu timing function called second().
*          A sample is shown below.  This is unfortunately rather
*          system dependent.  The code attempts to determine the
*          granularity of the clock to help interpret the results.
*          For dedicated or parallel runs, you might want to comment
*          these out and compile/link with "wallclock.c".
*       2) Stream requires a good bit of memory to run.
*          Adjust the Parameter 'N' in the main program to give
*          a 'timing calibration' of at least 20 clicks.
*          This will provide rate estimates that should be good to
*          about 5% precision.
*          ------------------------------------------------------------
*          Note that you are free to use any array length and offset
*          that makes each array larger than the last-level cache.
*          The intent is to determine the *best* sustainable bandwidth
*          available with this simple coding.  Of course, lower values
*          are usually fairly easy to obtain on cached machines, but
*          by keeping the test to the *best* results, the answers are
*          easier to interpret.
*          You may put the arrays in common or not, at your discretion.
*          There is a commented-out COMMON statement below.
*          ------------------------------------------------------------
*       3) Compile the code with full optimization.  Many compilers
*          generate unreasonably bad code before the optimizer tightens
*          things up.  If the results are unreasonably good, on the
*          other hand, the optimizer might be too smart for me
*          Please let me know if this happens.
*       4) Mail the results to mccalpin@cs.virginia.edu
*          Be sure to include:
*               a) computer hardware model number and software revision
*               b) the compiler flags
*               c) all of the output from the test case.
*
* Thanks
*=========================================================================
*
      PROGRAM stream
*     IMPLICIT NONE
C     .. Parameters ..
      INTEGER*8 n,offset,ndim,ntimes,maxtimes
      PARAMETER (maxtimes=10000)
C     ..
C     .. Local Scalars ..
      DOUBLE PRECISION dummy,scalar,t
      INTEGER*8 j,k,nbpw,quantum
C     ..
C     .. Local Arrays ..
      DOUBLE PRECISION maxtime(4),mintime(4),rmstime(4),
     $                 sum1,sum2,sum3,times(4,maxtimes),avgbw
      INTEGER*8 bytes(4)
      CHARACTER label(4)*11
C     ..
C     .. External Functions ..
      DOUBLE PRECISION second
      INTEGER checktick,realsize
      EXTERNAL second,checktick,realsize
C     ..
C     .. Intrinsic Functions ..
C
      INTRINSIC dble,max,min,nint,sqrt
C     ..
C     .. Arrays in Common ..
      REAL*8, ALLOCATABLE:: a(:),b(:),c(:)
C     ..
C     .. Common blocks ..
*     COMMON a,b,c
C     ..
C     .. Data statements ..
      DATA rmstime/4*0.0D0/,mintime/4*1.0D+36/,maxtime/4*0.0D0/
      DATA label/'Copy:      ','Scale:     ','Add:       ',
     $     'Triad:     '/
      DATA bytes/2,2,3,3/,dummy/0.0d0/
C     ..
*       --- SETUP --- determine precision and check timing ---
      PRINT *, "n, offset, ntimes"
      READ *, n, offset, ntimes
      ndim=n+offset
      IF (ntimes .GT. maxtimes) ntimes=maxtimes
      ALLOCATE (a(ndim), b(ndim), c(ndim))
      CALL defend_wrap
      nbpw = realsize()
      WRITE (*,FMT=9010) 'Array size = ',n
      WRITE (*,FMT=9010) 'Offset     = ',offset
      WRITE (*,FMT=9020) 'The total memory requirement is ',
     $  3D0*nbpw*DBLE(n)/ (1024d0*1024),' MB'
      WRITE (*,FMT=9030) 'You are running each test ',ntimes,' times'
      WRITE (*,FMT=9030) 'The *best* time for each test is used'
c$omp parallel do
      DO 10 j = 1,n
          a(j) = 1.0d0
          b(j) = 2.0D0
          c(j) = 0.0D0
   10 CONTINUE
      t = second(dummy)
c$omp parallel do
      DO 20 j = 1,n
          a(j) = 2.0d0*a(j)
   20 CONTINUE
      t = second(dummy) - t
      PRINT *,'----------------------------------------------------'
      quantum = checktick()
      WRITE (*,FMT=9000)
     $  'Your clock granularity/precision appears to be ',quantum,
     $  ' microseconds'
      PRINT *,'The tests below will each take a time on the order '
      PRINT *,'of ',nint(t*1d6),' microseconds'
      PRINT *,'   (= ',nint((t*1d6)/quantum),' clock ticks)'
      PRINT *,'Increase the size of the arrays if this shows that'
      PRINT *,'you are not getting at least 20 clock ticks per test.'
      PRINT *,'----------------------------------------------------'
      PRINT *,'WARNING -- The above is only a rough guideline.'
      PRINT *,'For best results, please be sure you know the'
      PRINT *,'precision of your system timer.'
      PRINT *,'----------------------------------------------------'
*       --- MAIN LOOP --- repeat test cases NTIMES times ---
      scalar = 1.5d0*a(1)
      DO 70 k = 1,ntimes
          t = second(dummy)
c$omp parallel do
          DO 30 j = 1,n
              c(j) = a(j)
   30     CONTINUE
          t = second(dummy) - t
          times(1,k) = t
          t = second(dummy)
c$omp parallel do
          DO 40 j = 1,n
              b(j) = scalar*c(j)
   40     CONTINUE
          t = second(dummy) - t
          times(2,k) = t
          t = second(dummy)
c$omp parallel do
          DO 50 j = 1,n
              c(j) = a(j) + b(j)
   50     CONTINUE
          t = second(dummy) - t
          times(3,k) = t
          t = second(dummy)
c$omp parallel do
          DO 60 j = 1,n
              a(j) = b(j) + scalar*c(j)
   60     CONTINUE
          t = second(dummy) - t
          times(4,k) = t
   70 CONTINUE
*       --- SUMMARY ---
      DO 90 k = 1,ntimes
          DO 80 j = 1,4
              rmstime(j) = rmstime(j) + times(j,k)**2
              mintime(j) = min(mintime(j),times(j,k))
              maxtime(j) = max(maxtime(j),times(j,k))
   80     CONTINUE
   90 CONTINUE
      WRITE (*,FMT=9040)
      avgbw = 0
      DO 100 j = 1,4
          rmstime(j) = sqrt(rmstime(j)/dble(ntimes))
          WRITE (*,FMT=9050) label(j),n*bytes(j)*nbpw/mintime(j)/1.0D6,
     $      rmstime(j),mintime(j),maxtime(j)
          avgbw = avgbw+n*bytes(j)*nbpw/mintime(j)/1.0D6
  100 CONTINUE
      WRITE (*,FMT=9050) "AvgBW:     ", avgbw/4D0
      sum1 = 0.0d0
      sum2 = 0.0d0
      sum3 = 0.0d0
c$omp parallel do
      DO 110 j = 1,n
          sum1 = sum1 + a(j)
          sum2 = sum2 + b(j)
          sum3 = sum3 + c(j)
  110 CONTINUE
      PRINT *,'Sum of a is = ',sum1
      PRINT *,'Sum of b is = ',sum2
      PRINT *,'Sum of c is = ',sum3
 9000 FORMAT (1x,a,i6,a)
 9010 FORMAT (1x,a,i10)
 9020 FORMAT (1x,a,f10.2,a)
 9030 FORMAT (1x,a,i3,a,a)
 9040 FORMAT ('Function',5x,'Rate (MB/s)  RMS time   Min time  Max time'
     $       )
 9050 FORMAT (a,4 (f10.4,2x))
      END
*-------------------------------------
* INTEGER FUNCTION dblesize()
*
* A semi-portable way to determine the precision of DOUBLE PRECISION
* in Fortran.
* Here used to guess how many bytes of storage a DOUBLE PRECISION
* number occupies.
*
      INTEGER FUNCTION realsize()
*     IMPLICIT NONE
C     .. Local Scalars ..
      DOUBLE PRECISION result,test
      INTEGER j,ndigits
C     ..
C     .. Local Arrays ..
      DOUBLE PRECISION ref(30)
C     ..
C     .. External Subroutines ..
      EXTERNAL confuse
C     ..
C     .. Intrinsic Functions ..
      INTRINSIC abs,acos,log10,sqrt
C     ..
C       Test #1 - compare single(1.0d0+delta) to 1.0d0
   10 DO 20 j = 1,30
          ref(j) = 1.0d0 + 10.0d0** (-j)
   20 CONTINUE
      DO 30 j = 1,30
          test = ref(j)
          ndigits = j
          CALL confuse(test,result)
          IF (test.EQ.1.0D0) THEN
              GO TO 40
          END IF
   30 CONTINUE
      GO TO 50
   40 WRITE (*,FMT='(a)')
     $  '----------------------------------------------'
      WRITE (*,FMT='(1x,a,i2,a)') 'Double precision appears to have ',
     $  ndigits,' digits of accuracy'
      IF (ndigits.LE.8) THEN
          realsize = 4
      ELSE
          realsize = 8
      END IF
      WRITE (*,FMT='(1x,a,i1,a)') 'Assuming ',realsize,
     $  ' bytes per DOUBLE PRECISION word'
      WRITE (*,FMT='(a)')
     $  '----------------------------------------------'
      RETURN
   50 PRINT *,'Hmmmm.  I am unable to determine the size.'
      PRINT *,'Please enter the number of Bytes per DOUBLE PRECISION',
     $  ' number : '
      READ (*,FMT=*) realsize
      IF (realsize.NE.4 .AND. realsize.NE.8) THEN
          PRINT *,'Your answer ',realsize,' does not make sense.'
          PRINT *,'Try again.'
          PRINT *,'Please enter the number of Bytes per ',
     $      'DOUBLE PRECISION number : '
          READ (*,FMT=*) realsize
      END IF
      PRINT *,'You have manually entered a size of ',realsize,
     $  ' bytes per DOUBLE PRECISION number'
      WRITE (*,FMT='(a)')
     $  '----------------------------------------------'
      END
      SUBROUTINE confuse(q,r)
*     IMPLICIT NONE
C     .. Scalar Arguments ..
      DOUBLE PRECISION q,r
C     ..
C     .. Intrinsic Functions ..
      INTRINSIC cos
C     ..
      r = cos(q)
      RETURN
      END
* A semi-portable way to determine the clock granularity
* Adapted from a code by John Henning of Digital Equipment Corporation
*
      INTEGER FUNCTION checktick()
*     IMPLICIT NONE
C     .. Parameters ..
      INTEGER n
      PARAMETER (n=20)
C     ..
C     .. Local Scalars ..
      DOUBLE PRECISION dummy,t1,t2
      INTEGER i,j,jmin
C     ..
C     .. Local Arrays ..
      DOUBLE PRECISION timesfound(n)
C     ..
C     .. External Functions ..
      DOUBLE PRECISION second
      EXTERNAL second
C     ..
C     .. Intrinsic Functions ..
      INTRINSIC max,min,nint
C     ..
      i = 0
      dummy = 0.0d0
      t1 = second(dummy)
   10 t2 = second(dummy)
      IF (t2.EQ.t1) GO TO 10
      t1 = t2
      i = i + 1
      timesfound(i) = t1
      IF (i.LT.n) GO TO 10
      jmin = 1000000
      DO 20 i = 2,n
          j = nint((timesfound(i)-timesfound(i-1))*1d6)
          jmin = min(jmin,max(j,0))
   20 CONTINUE
      IF (jmin.GT.0) THEN
          checktick = jmin
      ELSE
          PRINT *,'Your clock granularity appears to be less ',
     $      'than one microsecond'
          checktick = 1
      END IF
      RETURN
*      PRINT 14, timesfound(1)*1d6
*      DO 20 i=2,n
*         PRINT 14, timesfound(i)*1d6,
*     &       nint((timesfound(i)-timesfound(i-1))*1d6)
*   14    FORMAT (1X, F18.4, 1X, i8)
*   20 CONTINUE
      END
      SUBROUTINE defend_wrap
      INTEGER count, count_rate, count_max
      CALL SYSTEM_CLOCK ( count, count_rate, count_max )
      IF (DBLE(count) .GT. .999*DBLE(count_max)) THEN
         PRINT *,"Oops, this code won't handle a wrapping system_clock"
         PRINT *,"and soon we will wrap."
         PRINT 4, "count:", count, "count_max:", count_max
  4      FORMAT (1X, A10, I16)
         PRINT *,"Try again later, or fix the code to handle wraps."
         PRINT *,"(The counter wraps approx once every 60 hours)"
         STOP
      END IF
      END
      DOUBLE PRECISION FUNCTION second
      INTEGER count, count_rate, count_max
      CALL SYSTEM_CLOCK ( count, count_rate, count_max )
      second = DBLE(count)/DBLE(count_rate)
      END
%
% cat buildit.csh
#!/bin/csh
set verbose
unlimit
f90 -v -omp -source_listing -machine_code \
  -o mcc_omp_`date +%Y%m%d` \
  -fast -O5 -unroll 32 -arch ev6 \
  mcc_omp.f
grep COMPILER: mcc_omp.lis
%
% ./buildit.csh
unlimit
f90 -v -omp -source_listing -machine_code -o mcc_omp_`date +%Y%m%d` -fast
-O5 -unroll 32 -arch ev6 mcc_omp.f
/usr/lib/cmplrs/fort90/decfort90 -machine_code -fast -O5 -unroll 32 -arch
ev6 -I/usr/lib/cmplrs/hpfrtl -omp -reentrancy threaded -automatic
-source_listing -o /tmp/forAAAaeefga.o mcc_omp.f
/usr/bin/cc -v -o mcc_omp_20000613 -arch ev6
/usr/lib/cmplrs/fort90/for_main.o -source_listing /tmp/forAAAaeefga.o -O4
-pthread -qlshpf -lUfor -lfor -lFutil -lm -lots3 -lots -lm_c32
/usr/lib/cmplrs/cc.dtk/ld -o mcc_omp_20000613 -g0 -O4 -call_shared
/usr/lib/cmplrs/cc.dtk/crt0.o /usr/lib/cmplrs/fort90/for_main.o
/tmp/forAAAaeefga.o -qlshpf_r -qlshpf -qlUfor_r -lUfor -qlfor_r -lfor
-qlFutil_r -lFutil -qlm_r -lm -qlots3_r -lots3 -qlots_r -lots -qlm_c32_r
-lm_c32 -lpthread -lexc -lc
/usr/lib/cmplrs/cc.dtk/ld:
0.01u 0.02s 0:00 4% 0+16k 91+23io 0pf+0w 16stk+2272mem
grep COMPILER: mcc_omp.lis
COMPILER: Compaq Fortran V5.3-915-449BB
% ls
buildit.csh                        mcc_omp_20000613
mcc_omp.f                          stream_d.f_as_at_ftp_site_22may97
mcc_omp.lis                        typescript
% ./mcc_omp_20000613
 n, offset, ntimes
8005626,0,10\
forrtl: severe (59): list-directed I/O syntax error, unit -4, file
/dev/pts/3
% ./mcc_omp_20000613
 n, offset, ntimes
8005626,0,10
----------------------------------------------
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLE PRECISION word
----------------------------------------------
 Array size =    8005626
 Offset     =          0
 The total memory requirement is     183.23 MB
 You are running each test  10 times
 The *best* time for each test is used
 ----------------------------------------------------
 Your clock granularity/precision appears to be    100 microseconds
 The tests below will each take a time on the order
 of        40600  microseconds
    (=          406  clock ticks)
 Increase the size of the arrays if this shows that
 you are not getting at least 20 clock ticks per test.
 ----------------------------------------------------
 WARNING -- The above is only a rough guideline.
 For best results, please be sure you know the
 precision of your system timer.
 ----------------------------------------------------
Function     Rate (MB/s)  RMS time   Min time  Max time
Copy:       2472.7802      0.0632      0.0518      0.1165
Scale:      2363.2844      0.0552      0.0542      0.0570
Add:        2314.8798      0.0841      0.0830      0.0895
Triad:      2369.1125      0.0818      0.0811      0.0856
AvgBW:      2380.0142
 Sum of a is =   2.308224256790819E+018
 Sum of b is =   4.616448513254510E+017
 Sum of c is =   6.155264684672618E+017
% exit
%
script done on Tue Jun 13 09:56:07 2000

Next message: John Henning: "1cpu Compaq_AlphaServer_GS320 and Compaq_AlphaServer_GS160"
Previous message: John Henning: "2cpu Compaq_AlphaServer_ES40_667"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Sat Jun 17 2000 - 05:16:38 CDT