• Keine Ergebnisse gefunden

towards fault-tolerant computing or transaction processing—the market share for supercomputers remains at about 3%. Some supercomputer designs are partially based on standard workstations, but enhanced with special-purpose interconnection networks; the IBM SP-2 is a good example for a machine of this type.

All these factors contribute to making clusters a very viable alternative to custom-designed supercomput-ers. Consequently, there is an already large and growing interest in industry, not only with regard to parallel systems. As an example for this trend, consider Microsoft’s Windows/NT cluster system, Wolfpack [257], or the Virtual Interface Architecture (VIA) proposal [291], jointly promoted by Intel, Microsoft and Compaq.

VIA describes an architecture for the interface between computer systems and high-performance networks which aims at reducing application-level latency.

1.2 Problems with clusters

Given all these advantages of clusters like superior price/performance and time to market, why are supercom-puters still manufactured and sold? Apparently, there are still some areas where clusters do not constitute an acceptable solution. This section gives an overview of such issues and identifies areas that require additional research efforts.

1.2.1 Communication

The most evident problem of clusters—compared to supercomputers—is the efficiency of distributed com-putations. Since the CPU performance available in COTS systems is comparable and, owing to the long time-to-market of custom designs, sometimes even superior to that in custom-built supercomputers (as has been indicated by Table 1.1), the communication performance characterized by bandwidth, latency and over-head is the determining factor for parallel performance. This in turn depends mostly on the communication hardware and the integration of communication into the endsystem.

A number of challenges make high communication performance more difficult to achieve in a COTS cluster than in a supercomputer. The most important ones are: physical distance between nodes, integration of the network interface in a node’s hardware/software architecture, and the need for a higher level of protection of resources.

The small physical distances between nodes in a supercomputer allow the use of faster and more reliable communication hardware than in a cluster. The lower reliability of Local Area Networks (LAN) has forced clusters to use heavy-weight protocol stacks like Transmission Control Protocol (TCP)/Internet Protocol (IP), incurring a high performance penalty. This shortcoming is rapidly remedied with the advent of what has been called System Area Networks (SAN) [109]: Myricom’s Myrinet [39] or Compaq’s Servnet [252] are examples for networks that deliver Gigabits per second (Gbps) bandwidth and latencies of tens of nanoseconds, with very high reliability.

The second problem is integration of the network interface into the host architecture. Typically, network interfaces are connected to the I/O system of a COTS machine, whereas in a supercomputer, the network in-terface can be connected directly to the memory bus or the processor itself. This incurs performance penalties, but has been addressed by much research (an overview can be found, e.g., in [205]).

The question of virtualizing the network interface and protecting it from conflicting accesses from several processes constitutes the third problem. Since a supercomputer is often used by only one application at a time, this application can be granted uncontrolled access to a system resource like the network interface. In a COTS machine, on the other hand, the network interface has to be designed to protect multiple applications, which share a single machine, from each other; e.g., an application must not be allowed to receive messages addressed to another application.

Closely related to the question of communication performance is the question of synchronization. Syn-chronization is, in a certain sense, a prerequisite for communication, and some programming models make this very explicit. Additionally, closely synchronized execution of distributed parts of the program can have a large impact on performance. This is discussed in more detail in Chapter 8.

While the communication performance of clusters is, owing to these problems, not yet quite as high as that of supercomputers, much progress has been made (a more detailed discussion can be found in Section 3.2).

And with communication performance, the performance delivered to a parallel application also increases.

Pure performance is therefore not the issue of this dissertation.

1.2.2 Programming models

Writing a parallel program to execute in a cluster environment is a complicated endeavor compared to a supercomputer system. The machines in a cluster can well be heterogenous or at least of varied speed. Failures of machines may occur more likely in a cluster than in a closely administered machine, in particular if the machines in a cluster are shared with interactive users. The number of available machines in a cluster can well vary between different invocations of the same program. And although high-performance communication interfaces are becoming available for clusters, they are usually not nearly as well integrated in a cluster’s operating systems as are their counterparts in parallel supercomputers.

Other issues have more to do with programmability and appear in both supercomputers and clusters:

e.g., distributing complex data structures over connected machines. Such questions often have comparatively simple solutions in supercomputers since their tighter integration of computation and communication allows more convenient programming models such as Distributed Shared Memory (DSM).

This observation is key to many approaches: programming models with a higher level of abstraction hide irrelevant details from a programmer and allow him to concentrate on application-specific problems. It is therefore promising to hide cluster-specific complexities behind a simple programming model as well. The systems of Metacomputing in large asynchronous networks (Milan) project [23, 27, 64] follow this approach to hide complexities such as number, different speeds, and faults of machines by separating the semantics of a program from environment-specific issues. Calypso, one of these systems, is be described in more detail in Chapter 4.

Additionally, such abstract programming models lend themselves naturally to extending their semantics for inclusion of new properties. It is conceptually easy just to add yet another hidden complexity to such a model; nonetheless, the programmer and/or user have to provide sufficient information to make this possible.

A mechanism for a programmer to express additional information about a program is introduced in Chapter 9.

For users of high-performance systems, the abstraction level offered by such programming models is often still too low-level. A number of projects target tools, libraries and runtime environments that provide easier adaption of numerical problems, as well as interaction and integration of existing applications. Tradeoffs between performance and usability, however, are still an open question. A recent description of some such projects can be found in [244]

1.2.3 Intrusiveness

Intimately tied with the idea of COTS systems is the notion of non-intrusiveness: Not only should readily available components be used in system construction; moreover, they should be used as is, without requiring any unnecessary modifications. This idea is in sharp contrast with the design of supercomputers. While they increasingly often use standard components like microprocessors, they are often modified or endowed with additional, non-standard, custom-specific hardware (like interconnection networks, buses, cache controllers, or even such low-level components as the Translation Look-aside Buffer (TLB)) or software (in particular, modified operating systems).

For a truly COTS-based system, such intrusions are unacceptable. Any add-ons or modifications must always ensure the correct function of all services the system offered before and must coexist without inter-ference with these standard services—programs should still run, machines perform their functions as before, interfaces must not be changed. Also, no knowledge about internal mechanisms should be exploited, if it is available at all.

Such non-intrusiveness has implications for the design of additional functionalities. In particular, middle-ware approaches that are layered on top of existing services without blocking access to lower layers are good

1.2. PROBLEMS WITH CLUSTERS

candidates. In such an approach, an existing system is enhanced with additional software (and, if necessary, hardware) that provides the necessary functionality on top of the original system interfaces, without modifying them, but only adding new functionality to it—nothing that need not be modified should be modified. Any add-ons must be strictly transparent.

Similarly, the only acceptable interfaces for a middleware solution are those that are provided by the system in a standard manner. A middleware that adds new properties should adhere to all possible conventions of program interoperability. While this limits the space of potential solutions, it is a sine qua non of any COTS approach.

1.2.4 Management

A potential shortcoming of clusters is the lack of central information about the state of the cluster as a whole.

In a supercomputer, there is typically some centralized instance that provides a single representation of the entire system. This facilitates questions of administration, sharing of resources among multiple jobs (e.g., in a space-sharing fashion), fault masking (e.g., not allocating jobs to a failed processor) or timely coordination of resource usage (e.g., coscheduling [219]) and other system and resource management issues.

While it is possible to provide such a single image of the state of a cluster, it is an expensive undertaking in terms of runtime overhead and might nonetheless result in information of only limited precision. It is therefore a legitimate question to ask how to decentralize these problems and how to solve them in a less tightly-coupled environment such as a cluster. In [188], albeit in a slightly different context, three possible approaches to such a question are discussed. The “omniscient” approach corresponds to the centralized information as found in a supercomputer. Obvious problems with this approach include scalability and fault tolerance. An alternative is “tamed nondeterminism”, implemented via consensus protocols, which means the periodic exchange of knowledge and the achievement of consensus on future actions. Third, completely independent systems pursue their own objectives in an autonomous fashion.

These question become particularly interesting when combined with the demand for non-intrusive solu-tions. Also, management is never an end in itself but only a means for other objectives. As a concrete cases of the issues arising in system management, managing resources in a cluster-based system so as to guarantee access to resources for both sequential and parallel programs is discussed in Chapter 8.

1.2.5 Predictability and timeliness

In a typical supercomputer environment, users of such a machine have yet another requirement: they want to depend on their programs being completed at a certain time. Historically, this has been more of an obligation to users because maximum runtimes were and are often used to plan the order of program execution to maximize the utilization of a supercomputer. Over time, this has developed more into an expectation and people are often willing to bear the inherent burdens (like specifying maximal resource requirements of a program when submitting a program) to be able to rely on such predictable completions.

Such an ability to complete programs in time is crucial in a number of applications. Examples include signal processing in real time (e.g., processing radar signals [193]), weather-related services (LEEet al. [166]

describe a scenario where an IBM SP-2 has been used as part of a wide area scenario to process satellite images for cloud detection in nearly real time), the “almost real time” visualization of microtomography experiments [296], or even large-scale battlefield simulations (where interactiveness makes timely completion of programs an indispensable condition). Therefore, executing programs in a timely manner is a capability that clusters should also be able to provide.

Meeting this requirement of predictable and timely execution of programs is not a simple task in a cluster.

A number of factors contribute to this difficulty. One is the fact that clusters are often used in a time-shared fashion. This sharing can happen among multiple parallel programs or between parallel programs and interac-tive users. In either case, there is contention for resources, possibly limiting predictability and timeliness if this contention in itself is unpredictable. This contention raises the need for resource management functionality to deal with it.

A second factor is related to this time-shared usage: clusters are commonly less well guarded then super-computers; it is, e.g., readily possible that someone reboots a machine within a cluster. Such rebooting has similar consequences as a crash fault of a machine, and faults in general are always a possibility that must be dealt with. The existence of faults also implies that, while predictability can be a useful tool to achieve time-liness, it is not a sufficient property: A program that always crashes before producing any results is perfectly predictable (and might even crash on time), but useless. Consequently, timeliness must be accompanied by dependability and corresponding fault-tolerance mechanisms to be useful.

The third factor is that, even given information about the program, and even in an absence of faults, the particular execution regime of a parallel programming system can introduce some uncertainty over the runtime of a program (e.g., owing to random effects like caching during program execution). This uncertainty is aggravated by faults and requires an analysis of the program runtimes in an appropriate model. Similarly, the technical infrastructure of a typical cluster may not be as suitable to timely execution of parallel programs as that of a supercomputer, potentially owing to rather low-level properties: the inherently probabilistic Ethernet is less predictable than a deterministic interconnection networks.

These factors show that, while timely program execution is necessary for a growing number of applica-tions, there are still many open questions to be solved before a cluster of workstations is a suitable environment for such applications. This dissertation attempts to contribute a few solutions to some aspects of this problem.