This web page contains the slides and commentary from the keynote address at the Third Annual IEEE Workshop on Workload Characterization, held in Austin, TX, on September 16, 2000

Commentary not in the original slides is added in green.

An Industry Perspective on Performance Characterization: Applications vs Benchmarks

John D. McCalpin, Ph.D.
Senior Scientist
IBM

September 16, 2000

An Industry Perspective on Performance Characterization: Applications vs Benchmarks

Applications are the reasons why people buy computers.

Benchmarks should be standardized, concise distillations of the "important" features of those applications.

Micro-Benchmarks attempt to single out one feature at a time. e.g., STREAM
Application Benchmarks attempt to cluster features in a "representative" manner. e.g., SPEC

Audiences

Users of Benchmarks include:

people who buy computers
people who sell computers
people who design computers

Purchasers of computers want the benchmarks to tell about performance on a set of target applications

Sellers of computers want the benchmarks to tell enough about performance to get the purchaser's attention.

Designers of computers want the benchmarks to represent the "important" details of the applications so that they can be used as "concise distillations" in the design process.

Performance Characterization is the means by which those distillations should be judged.

One might expect that the assemblers of industry standard benchmark suites would already have made the quantitative evaluations of the benchmarks vs important applications. The following quote gives an indication of how the SPEC CPU committee has historically seen its role in this respect:

Example: SPEC CPU Benchmarks

"The goal of the SPEC CPU benchmarks is to provide a standardized performance metric that is better than Dhrystone or MHz."

(an anonymous, but important, person in the SPEC organization)

But I certainly don't want to single SPEC out for ill-treatment. Here is some data I collected in 1998 on the usefulness of the LINPACK 1000 benchmark for predicting application performance in scientific and engineering applications.

I gathered all of the publicly available application and LINPACK benchmark data on then-current server systems from SGI, IBM, Sun, Compaq, and HP. For each machine where I had application performance data, I took the ratio of that machine's application performance relative to a 195 MHz SGI Origin2000 as the "y" value. The corresponding "x" value is the ratio of the LINPACK 1000 performance of the sample machine relative to the 195 MHz SGI Origin2000.

As you can see, the scatter in the data suggests that there is no correlation at all!!!!

For those who like numbers, the correlation coefficient is 0.015, and the best-fit curve has a slope of about 0.1 and a y-intercept of about 0.9. That means that the best-fit curve says that doubling the LINPACK performance corresponds to a 10% application performance increase. Not a very useful predictor....

Although I have not included the chart here, the STREAM benchmark is similarly useless for predicting application performance for these scientific and technical applications.

Are there better single figures of merit? I found two of approximately equal predictive value. The first is the SPECfp95 performance. The correlation coefficient is about 0.39, which is not great. The better news is that the slope of the best-fit curve is almost exactly 1.0 and the y-intercept is almost exactly 0.0. So the best-fit curve suggests that (in an RMS average sense) doubling the SPECfp95 score corresponds to doubling the application performance.

The second single-figure of merit predictor is an optimal combination of the LINPACK and STREAM performance values. I don't recall the relative coefficients, but the correlation coefficient was about 0.36 -- pretty close to that of the SPECfp95 result.

The next section of the talk is a more detailed focus on one particular application area: scientific and engineering computing. I choose this area for discussion simply because I have lots of data, not because such studies are uninteresting in other areas. To put numbers on the discussion, the market for server systems for scientific and engineering computing is about $6B/year, compared to the $24B/year for general-purpose UNIX servers. So it is not a large market, but not a negligible one, either.

Performance-Related Characteristics of Scientific and Engineering Applications

John McCalpin, IBM (mccalpin@us.ibm.com)
Ed Rothberg, ILOG (rothberg@ilog.com)

(This work was performed while the authors were employed by SGI)

Motivations and Assumptions

We suspected that both academia and industry were too focused on standard benchmarks, and that these benchmarks might be significantly different than commercially important applications.

We needed a broad study so that revenue-based weighting could be used to interpret the results.

We chose commercial and community applications because of their economic importance and relative stability.

What This Is:

A broad overview of the performance characteristics of commercial and community codes in science and engineering.

A comparison of these performance characteristics with those of a number of industry standard benchmarks.

Based solely on performance counter measurements from uniprocessor runs of each application.

To give an idea of the value of performance counter measurements for understanding application performance, I took the counter results for the application set (described later) and plugged them into a very simple CPI-based model. The model was not tuned for this data set -- I simply applied what I thought the correct costs should be per event for the target machine (a 195 MHz SGI Origin2000 with 4 MB L2 cache). Most of the predicted performance results were within +/- 20% of the observed values, which seems pretty good for a model with no actual knowledge of the microarchitecture.

What This Is NOT:

NOT a detailed study of the performance characteristics of any individual application.

NOT based on any trace-driven or execution-driven simulations.

NOT providing any direct information about the performance of any application.

Methodology

Applications chosen were "important" applications supported by SGI.

The analyst in charge of each application chose a "relevant" data set --- no toy problems!

Applications were run with full CPU performance counters on an SGI Origin2000 system.

195 MHz CPU

4 MB L2 cache

1 GB local RAM

Application Coverage

Linear Finite Element Analysis (3 data sets, 2 applications)

Nonlinear Implicit Finite Element Analysis (8 data sets, 3 applications)

Nonlinear Explicit Finite Element Analysis (3 data sets, 3 applications)

Finite Element Modal (Eigenvalue) Analysis (6 data sets, 3 applications)

Computational Fluid Dynamics (13 data sets, 6 applications)

Computational Chemistry (7 data sets, 2 applications)

Weather/Climate Modelling (3 data sets, 2 applications)

Linear Programming (2 data sets, 2 applications)

Petroleum Reservoir Modelling (3 data sets, 2 applications)

Benchmark Coverage

NAS Parallel Benchmarks Version 1, Class B, pseudo-applications (3 benchmarks)

NAS Parallel Benchmarks Version 1, Class B, kernels (5 benchmarks)

SPEC CFP95 (10 benchmarks)

SPEC CPU2000 (13 benchmarks -- one is missing)

LINPACK NxN (heavily optimized, N>1000)

STREAM (whole-program statistics)

Primary Metrics Analyzed

Execution Rate (MIPS)

Bandwidth Utilization

Main Memory

L2 cache

L1 cache

Cache Miss Statistics: Icache, Dcache, and L2 cache

TLB misses

Mispredicted Branches

Instruction Mix (FP, branch, LD/ST, other)

I begin with my standard overview chart, showing the spread of execution rates vs memory bandwidth utilization. The dashed line is a trivial model that assumes a core performance of 285 MIPS and a stall on each cache line miss with a cost equal to the time per cache line transfer measured using the TRIAD kernel of the STREAM benchmark. It is not surprising that this simple model fits the data fairly well, since the compiler used for this study (in early 1997) did not generate prefetch instructions, and the OOO attributes of the machine are not adequate to tolerate the ~60 cycles of main memory latency.

The most important thing to get from this picture is that many of the application areas "clump" into regions of similar memory bandwidth utilization, even using completely different codes from different vendors, in some cases written in different decades!
The Computational Fluid Dynamics (CFD) codes are an exception, with a long story that I don't have time to go into here....

To keep this presentation manageable, I will limit the following discussion to comparisons of the applications (all grouped together) with the SPEC CFP2000 benchmarks. These results were taken from a pre-release version of the benchmarks, and so may not precisely correspond to the final version -- in particular, one of the 14 final benchmarks is missing from this set -- but I do not expect that the overall performance characteristics changed much between this version and the final version.

I begin with a repeat of the previous slide, but with the applications grouped together and the CFP2000 benchmarks added. The most important thing that I see in this picture is that the SPEC CFP2000 benchmarks do not capture the large group of applications with low bandwidth demand and high IPC. In fact, it is not hard to convince one's self that the overall bandwidth demand of the CFP2000 benchmarks is somewhat higher than that of the real applications.

Now I move to a review of the L2 cache bandwidth utilization. Interestingly, the SPEC benchmarks use less bandwidth than the real applications. A code by code comparison shows that in a number of cases, the L2 utilization for the SPEC CFP2000 benchmarks is limited by the main memory bandwidth --- i.e., the data is streaming through the L2 cache with minimal reuse.

Now I jump to the TLB miss rate. Unfortunately, these results are bogus, since the SPEC CFP2000 results were run using large pages and therefore have anomalously low TLB miss rates. I have done this comparison with the CFP95 benchmarks using small pages (i.e. the same as the applications used) and found that the difference is real, but the ratio of TLB miss rates is closer to one order of magnitude than the three orders of magnitude shown here.

Recalling the last three slides, I showed that the CFP2000 benchmarks

use more memory bandwidth

use less L2 bandwidth

have significantly lower TLB miss rates

Based on these items and discussions with the application analysts, I make the following conjecture:

The SPEC CFP2000 benchmarks are not effectively blocked for L2 cache and the SPEC CFP2000 benchmarks access memory in a more regular (dominantly unit-stride) fashion than the real applications.

The next slide shows the Icache miss rates for the benchmarks and applications, and shows that the SPEC benchmarks are mostly at least an order of magnitude lower than the applications in Icache miss rates. This is important because many of the applications have Icache miss rates high enough to be ~10% contributors to the code's CPI and cannot reasonably tolerate worse Icache performance (either higher rates or bigger latency penalties for misses). The SPEC codes are in agreement with the incorrect "traditional wisdom" that scientific and engineering codes have minimal Icache performance requirements.

The SPEC codes show very good agreement in the range and mean values of mispredicted branch rates. It is not clear to me that these are correct for the right reason. Discussions with the analysts have suggested that the real applications have significantly more "IF" statements than the benchmarks, and it seems likely that many of their mispredicted branches are due to these constructs. Reviewing the SPEC CFP2000 codes suggests that most of the mispredicted branches are associated with loop termination. This remains an area for further investigation.

One of the most interesting results of the study is in the dynamic instruction distribution. The following chart shows the load/store operation fraction vs the floating-point operation fraction. Note that the real applications are typically only 10%-30% FP instructions, while the SPEC benchmarks range from 22%-45% FP. The important microarchitectural feature is the ratio of loads and stores to FP operations -- for the applications it varies from 1 to 4 with a mean of about 2, while for the benchmarks it ranges from 0.75 to 2, with a mean near 1.

Conjecture:

Real applications have many loads and stores associated with register spills due to complex inner loop bodies. The SPEC CFP2000 codes have few spills because their inner loop bodies are much simpler.

Finally, I review the role of the caches as "bandwidth reducers". The "bandwidth reduction factor" is defined for each level of the cache hierarchy as the bandwidth going "in" (toward or from the core or lower levels of the cache hierarchy) divided by the bandwidth going "out" (to/from memory or the next outer level of the cache hierarchy). I define these separately for the L2 and L1 (data) caches, and note that the total bandwidth reduction parameter is the product of these two values.

The L1 bandwidth reduction parameters for the SPEC CFP2000 benchmarks cover pretty much the same range as those of the real applications, with a few outliers.

The L2 bandwidth reduction parameters for the SPEC CFP2000 benchmarks do not show the same pattern as the applications, which have many results in the range of 10 to 50, where there is only one SPEC CFP2000 results. The SPEC CFP2000 results also have two outliers, with L2 bandwidth reduction parameters of 1500 and 4000, respectively, corresponding to fully cache-contained codes. There were no fully cache-contained application test cases.

These results are consistent with the conjecture that the SPEC CFP2000 codes are not effectively blocked for L2 cache, while many of the applications are effectively blocked.

Summary

We surveyed 25 applications processing a total of 48 data sets

Performance-related application characteristics differed significantly from standard benchmarks in:

memory bandwidth used

dynamic instruction mix

Icache miss rate

TLB miss rate

Conclusions

Current CPU-intensive benchmarks have some similarities with economically important applications, but also many important differences.

These benchmarks are not useful for engineering trade-offs until the degree of similarity that they show with applications is quantified.

In order to make this quantitative evaluation, we need to gather and share a lot more data from real applications.

The SPEC CPU committee is investigating conducting such a shared survey in support of the development of the SPEC CPU 200x benchmark suite.

Revised to 2000-10-04
John McCalpin
john@mccalpin.com