• Keine Ergebnisse gefunden

The following benchmarks show resu lts as of May 1994, six months after the rci<:ase of the CRAY T3D product. The resu lts ind icate that in this short span of time, the CRAY T3D system substantial ly outper­

formed other MPPs.

As shown in Figure 4, a CRAY T3D system with 256 processors del ivered the fastest execution of all eight NAS Parallel Benchmarks on any MPP of any size -'' (The NAS Parallel Benchmarks are eight codes specified by the Nu merical Aerodynamic Simu la­

tion [NAS] program at NASA/Ames Research Center.

NAS chose these codes to represent com mon types of fl uid dynamics calcu lations.) The CRAY T3D sys­

tem scaled these benchmarks more efficiently than all other MPPs, \Vith near linear scaling from 32 to

45 40

35 (j)

1-z 30

w -'

<(

2:: 25 ::::>

a w ::::> 20

(\_ u 0 1 5 0>

u 1 0

5

0

EP FT MG IS

KERNELS KEY:

A Shared Memory MPP from Cray Research

64, 128, and 256 processors. Other MPPs scaled the benchmarks poorly. None of these other MPPs reported all eight benchmarks scaling to 256 pro­

cessors, and the scaling reported showed more nonl i near scal ing than on the CRAY T3D system.

These benchmark results confirm that the superior speed of the CRAY T3D interconnection network is important when scal ing a wide range of algorithms to run on hundreds of processors.

Note that a 256-processor CRAY T3D system was the fastest MPP ru nning the NAS Paral lel Benchmarks.

Even so, the CRAY C916 paral lel vector processor ran six of the eight benchmarks faster than the CRAY T3D system. The CRAY T3D system (sel ling for about $9 mi l l ion) showed better price/performance than the CRAY C916 system (sel l ing for about $27 mill ion). On the other hand, the CRAY C916 system showed better absolute performance. When we run these codes on a 512-processor CRAY T3D sys­

tem (later this year), we expect the CRAY T3D to outperform the CRAY C916 system on six of the eight codes.

CG BT SP LU

APPLICATIONS

D

GRAY C9 1 6

CRAY T3D 256 PEs

0

OTHER MPPs (FASTEST REPORTED ON 64 TO 5 1 2 PEs) EP EMBARRASSINGLY PARALLEL (TYPICAL OF MONTE CARLO APPLICATIONS)

FT 3-D FAST FOURIER TRANSFORM PARTIAL DIFFERENTIAL EQUATION (TYPICAL OF '"SPECTRAL'" CODES) MG MULTIGRID (SIMPLIFIED MULTIGRID KERNEL SOLVING A 3-D POISSON PARTIAL DIFFERENTIAL EQUATION) IS INTEGER SORT

CG CONJUGATE GRADIENT (TYPICAL OF A LARGE. SPARSE MATRIX) BT BLOCK TRIDIAGONAL (TYPICAL OF ARC3D)

SP SCALAR PENTADIAGONAL (TYPICAL OF ARC3D)

LU LOWER-UPPER DIAGONAL (TYPICAL OF NEWER IM PLICIT COMPUTATIONAL FLUID DYNAMICS)

Figure 4 NAS Parallel Benchmarks

Digital 1ech11it"al ]om-, til 14>1. o 'o. 2 Spring I'J'J4 1 9

Alpha AXP Part ners-Cray, Raytheon, Kubota

Heterogeneous benchmark results are also encouragi ng. We hench marked a chem istry app l i­

cation, Sl'l'ER�IOLECU LE, that simu lates an imida­

zole molecule on a C:RAY T30 system with a CRAY Y-MP host. The appl ication was <.>H percent parallel, with 2 percent of the overa l l time spent i n serial code (to diagonal ize a matrix). We made a baseline measurement by running the program on 64 CRAY T3D processors. Quadrupling the n umber of pro­

cessors (2'5o PEs) showed poor scaling-a speedup of I . .) times over the baseline measurement. When we moved the serial code to a CRAY Y-M P processor on the host, leaving the para l lel code on 2'5() CRAY 1'30 processors, the cot:le ran 3. 5 t i mes faster than the hasdine, showing substantial l y more efficient scal i ng. Figure 5 shows S l i PElU10LECl i i .E bench­

mark perf(>rmance resu lts on both homogeneous and heterogeneous systems. Ninety-eight percent may sound l ike a high levd of paral lel ism, but after d ivid ing <)H percent among 2'5o processors, each processor ran less than 0.4 percent of the overal l paral lel time. The remaining serial code ru nning on a single PE ran five ti mes longer t han the distributed para l lel wo rk, thus dominating the time to solu­

tion. Speeding up the serial codc by running it on a faster vector processor brought the serial ti me in l ine with t he distributed-raral lel time, i mproving the seal ing considerabl y.

3.5

Figure 5 SUPERMOLECULL Benchmark Perforl/lance Results for Homogeneous and Heterogeneous .�JJStems

The C RAY T3D system demonstrated fastn 1/0 throughput than any other MPP. A 256 -pron:ssor system sustained on.:r '570 megabytes per second of 1/0 to a disk file system residing on a solid-state device on the host. The system sustained over 360 megabytes per second to physical d isks.

Summary

This paper describes the design of the CRAY T)D system. Designers incorporated applications pro­

files ancl customer suggestions into the CRAFT program ming model. The model permits high­

perfor mance exploit at ion of important comru ta­

tional algorithms on a massively para llel processing system. Cray Research designed the hardware based on the fundamentals of the program m i ng model.

As of this writing . a dozen systems have shipped to customers, with results that show the svstem design is delive ring excel lent performance . The CI�\Y T3D system is sca l i ng a wider range of codes to a larger number of processors and runni ng benchmarks faster than other MI'Ps. The sustained 1/0 rates arc also faster than on other ;viPPs. The sys­

tem is performing as designed .

References

ceeetings of COMI'CON, 1<)9:) 176-1H2.

). S. Scott and G. Thorson, "Optim ized Rou ting in tht· CRAY 1'31 )." extended abstract for the Pa rallel Computing Routiug and Co mmuni­

cation Workshop (1994).

4. D. Pase, T. MacDonald. and A. Meltzer, ''The CRAFT Fortran Program ming Moc.ll· l :' CRAY Internal Report (Eagan, M N : Cray Research, Inc . , February 1995) and Scien tific Program­

ming (New York: Jo hn Wiley and Sons, fo rth­

6. G. Fox er al., Fortran /) f.anguag<' Specifica­

tion (Houston, TX: Department of Computer Science, Rice University, Technical Report TR90-141, December 1990).

7. B. Chapman, P Mehrotra, and H. Zima, Vienna Fortran-A Fo rtra n Language Extension for Distribu ted Memory Multipro­

cessors (Hampton, VA : !CASE, NASA Langley Research Center, 1991).

8. High Performance Fortran (High Perfo r­

mance Fo rtran language Sj;ecification, Version UJ) (May 1993). Also available as techn ical report CRPC-TR 92225 ( l louston, TX : Center for Research on Parallel Computation, Rice University) and in Scientific Computing (forthcoming).

9. P Mehrotra, "Programm ing Paral lel Architec­

tures: The 13LAZE Family of Languages," Pro­

ceedings of the Th ird SIAM Conference on Parallel Processing for Scientific Compu ting ( December 1988): 289-299.

Digital 1l•clmical journal \ltJI. 6 No. 2 SjJrill� 1')')4

A .\hared iHeJJWty MPP from CrczJ' NesearciJ

10. R. Millstein, "Cont rol Structures in ILI.IAC IV f-ortran," Communications of the ACM, vol.

16, no. 10 (October 1973): 621-627.

1 1 . ). Kohn and W Will iams, "ATExpert," Thejour­

nctl of Parallel and Distributed Compu ting, 18 (June 1993): 205-222.

12. R. Sites. eel . , Alpha Architecture Refer('nce Manual (Burl ington, MA: Digital Press, Order

No. EY-L520E-DP, 1992).

13. DJ:'Cchip 27064-AA Microprocessor Hardware Reference Man ual, 1 st ed. (Maynard, MA : Digital Equipment Corporation, Order No.

EC-N0079-72, October 1992).

14. D. Bailey and R. Schreiber, "Problems with RISC Microprocessors for Scientific Comput­

i ng" (Moffet Field, CA: NASA/Ames, RNR Tech­

nical Report, September 1992).

15. D. Bailey, E. Barszcz, L. Dagum, and H . Simon,

"NAS Parallel Benchmark Results" (Moffet Field , CA: NASA/Ames, R.i\I R Technical Report, March 1994 [updated May 1994] ) .

2 1

Robert Couranz I

The E2COTS System and