• Keine Ergebnisse gefunden

3.6 Focus on Quality of Service

3.6.3 Endsystem Quality of Service

Even a hypothetical interconnection network that gives perfect Quality-of-Service guarantees between two hosts is not sufficient to ensure that messages between distributed parts of an application are delivered on time. Within a time-shared endsystem, an application process usually has to compete for resources that are necessary to process, send, and receive messages. Examples for these resources are the Central Processing Unity (CPU), cache, memory and I/O bus, bridges between these buses, DMA engines, and the network interface itself. Both the total amount of these resources and their temporal availability are crucial, as is the enforcement of contracts or subscriptions by the operating system. It is therefore necessary to build systems that perform admission control to avoid over-subscription and that schedule the correct resources for the right message at the right time. To this end, a number of Quality-of-Service architectures have been proposed;

overviews of these architectures can be found in [16, 208]. Some notable systems or proposals include the following.

NAHRSTEDT and SMITH [210] clearly show the need for Quality-of-Service support in the end sys-tems. They showed that operating system effects dominated any effects caused by the network in an experimental investigation of end-to-end Quality of Service. They observed that process priorities are of limited usefulness, must be coordinated among distributed applications, and integrated in the network protocol stack processing.

Based on this work, the QoS Broker introduced by NAHRSTEDT and SMITH [211] focuses on mul-timedia applications (a telerobotics application was used as a test case) and coordinates the activities and resource requirements of multiple protocol processing layers. The broker negotiates with network management and remote brokers and selects runtime policies to implement an application’s desired Quality of Service. For example, to bound jitter, the broker can decide whether to use more buffer space, tighten the jitter requirements on the network, or closely coordinate the execution of processes in time. Therefore, the broker has responsibilities of call admission and Quality-of-Service negotiation and translation.

3.6. FOCUS ON QUALITY OF SERVICE

This broker was later completed by the OMEGA endpoint architecture for provisioning of Quality-of-Service guarantees [209]. OMEGA essentially separates an application level protocol from a network level protocol stack as schedulable entities. From this architecture, some unexpected lessons were learned, e.g., the difficulties induced by a blocking DMA engine for transfers to the network interface, and that the “real-time” priorities of the underlying AIX UNIX are not sufficient for Quality-of-Service protocol processing.

GOPALAKRISHNAN and PARULKAR[90] propose a Quality-of-Service framework for multimedia ap-plications. It focuses on three main points: on Quality-of-Service specification at the application level, where it identifies an isochronous, a bulk data, and a low delay class; on Quality-of-Service mapping from application level to network level Quality-of-Service parameters; and on Quality-of-Service en-forcement. A traditional mechanism for Quality-of-Service enforcement would be to assign threads to independently schedulable protocol operations and schedule them via real-time scheduling methods like EDF or RMS. However, doing so is inefficient since processing a single data unit can be less time consuming than the context switch to start a corresponding thread [56]. Instead of threads, this frame-work uses so-called real-time signals. These signals are scheduled using rate monotonic scheduling with delayed preemption [91], i.e., a handler invoked by a signal is preempted only at the end of an iteration, which processes a single data unit. An implementation mechanism for this framework is also discussed.

MEHRA et al. [197] introduce a Quality-of-Service-sensitive communication subsystem architecture, which ensures (1) maintenance of Quality-of-Service guarantees, (2) overload protection, and (3) fair-ness to best-effort traffic. Real-time channels [126] are assumed as underlaying, Quality-of-Service-capable network. This subsystem is based on an x-kernel that has complete control of the CPU and can be implemented as a separate server running with suitable capacity reserves [169]. Such a server implementation, however, is somewhat in conflict with the need for user-level protocol processing nec-essary for high-performance protocols (see Section 3.2). Quality-of-Service contracts are enforced and messages are processed using a process-per-channel paradigm.7 The host resources CPU bandwidth, link bandwidth, and buffer space are managed by such channel handler processes. For this architecture, MEHRAet al. [196] discuss tradeoffs between resource capacity and channel admissibility for real time, where preemption grain of CPU and links are significant parameters.

A lot of research has been performed, and much progress has been made. Two main themes can be observed: either an attempt to cope with given operating system environments and to manipulate them as far as possible (e.g., modifying priorities like in [209]), or modifying the operating system extensively. Most approaches of the latter kind share early demultiplexing of arriving packets and handling them by separately schedulable entities as a common technique—ideally supported by programmable network interfaces.

However, it is still too early to judge what are the best solutions for a given application. There is still a long way to go to developing a commonly applicable, generally available host architecture that is capable of providing an underlying network’s Quality-of-Service guarantees to an application.

7As opposed to the process-per-protocol model sometimes used in UNIX or an also conceivable process-per-message model.

A theory has only the alternative of being right or wrong. A model has a third possibility: it may be right, but irrelevant.

– Manfred Eigen

Chapter 4

Problems in Responsive Cluster Computing—The Calypso Case

In Chapter 3, two parts of the Milan project, Calypso and Charlotte, have been identified as good candidates to build a responsive system for high-performance computing based on COTS components. The basic techniques and the implementation of these systems are described here in greater detail (see Section 4.1), concentrating on the Calypso system. A number of factors limiting Calypso’s responsiveness are identified in Section 4.2, which are then remedied in the following chapters. The discussion of Charlotte is postponed to Chapter 9, where issues of metacomputing in wide area networks are examined. A sample Calypso program and some experiments are discussed in Section 4.3 and Section 4.4, respectively.

4.1 An overview of Calypso

Calypso [23] is a software system for parallel programming in cluster environments. One of the main objec-tives of Calypso is to provide a simple programming environment: A programmer should not have to worry about the complexities of an actual execution environment, like the number of available machines, the rela-tive speed of machines, or potentially failing machines. It is much easier to write programs for an abstract, simplified, perfect environment. Calypso provides such an abstraction—a Calypso program is written for an idealized, perfectly reliable PRAM with an infinite number of processors. It is the responsibility of the Calypso middleware to implement this abstraction on a set of real machines, hiding the aforementioned complexities.

The theoretical background for the Milan project (of which Calypso is but one system) is work on efficient asynchronous execution of large-grained parallel programs (cp., e.g., [15]). Consider a program P that is written for an idealized, synchronous PRAM machine, using a BSP-like style [287]. P consists of a number of parallel steps, where each step has a number of parallel routines. Assume further that in one parallel step, any shared variable is updated by at most one routine. It is a challenging problem to execute such a program on a realistic, asynchronous machine (where a processor can also become infinitely slow, i.e., fail). This problem is solved in [15] by compilingPinto a semantically equivalent programC(P)that can be efficiently executed on an asynchronous machine.

To write programs for such an idealized PRAM, Calypso extends the C programming language with only four keywords: shared,parbegin,parend, androutine.sharedis used to declare data as accessible from concurrently executing parts of the code. The keywordsparbeginandparendtogether demarcate a parallel step: code inside such a parallel step can be executed in parallel, code outside is run sequentially.

Hence, a Calypso program is an alternation of sequential and parallel steps. Within a parallel step,routine denotes one or more units of concurrent execution. Aparendconstitutes a barrier synchronization for all routines within a parallel step—the step ends once all its routines have terminated. This concept of barrier

synchronization for parallel routines captures the essential point of VALIANT’s BSP programming model [287]. An example for such a parallel step would look like this:

parbegin

routine[int-expr] (int width, int id) {routine body 1} routine[int-expr] (int width, int id) {routine body 2} ...

routine[int-expr] (int width, int id) {routine body n} parend

The routine body of a routinestatement is the sequential code executed as a parallel routine. The optional int-expr argument of routineis the number of instantiations of such a routine. For the formal parameter width the actual number of routines at invocation time is inserted; id is the unique number for each routine. One routine is therefore able to identify its own identity relative to its sibling routines.

Such a routine can have local variables, and it can access global variables as shared memory if they are annotated with shared. The memory consistency model for such a routine is also very simple: local variables are initially undefined, and for all routines within a parallel step, the shared variables retain the value they had at the beginning of a parallel step. Access to shared data follows the Concurrent Read Exclusive Write (CREW) policy: a data item can be read by any number of routines, but only written by at most one.

Programs with Concurrent Read Concurrent Write (CRCW) behavior are also executed correctly if all write accesses to a variable write a unique value. Write updates to shared data occur atomically at the end of a parallel routine. As a consequence, all routines execute in isolation, and updates to shared memory are only visible after a parallel step has finished. This isolation allows the programmer to ignore the order of execution of routines when writing the program. In particular, a read of any shared variable always returns the value the variable had at the beginning of a parallel step unless the variable has been modified locally within the routine itself. While this simple consistency model does not allow some optimizations enabled by relaxed consistency models, it is commensurate with the argument brought forward by HILL [105]: with modern processor’s speculative execution, the complexity of relaxed models outweighs their possible performance benefits.

How can this semantics be implemented on real, unreliable machines? A Calypso program executes in a master/worker fashion. The master process executes all sequential steps and manages the execution of the parallel steps. The routines of these parallel steps are executed by any number of worker processes, usually residing on remote machines. At the beginning of a parallel step, the master waits for workers requesting work.

Upon such a request, the master assigns a routine to a worker, which will then execute the routine. During execution, the worker will (usually) access data in the shared memory. The shared memory is implemented at the page level: at the beginning of a parallel step, all the pages of the shared memory are protected against access. If a worker reads a variable located on such a page, the operating system raise a page fault exception.

The Calypso library catches this exception and requests the corresponding page from the master. A write access to a protected page marks it as dirty with a similar mechanism, and at the end of a routine, all dirty pages are sent back to the master.

The master, however, cannot yet integrate such a dirty page in the shared memory, since then a request for this page by another worker would possibly result in a value different from the one at the beginning of a parallel step. Hence, the page updates from the workers can only be included in the master’s shared memory when a parallel step has been completed. This memory management is called Two-phase Idempotent Execution Strategy (TIES).

TIES enables another important technique: eager scheduling. It is at the core of Calypso’s mechanism for load balancing and tolerating worker failures. If a worker fails the routine it has been assigned will not be finished, and the master would wait for this routine, even though other workers become idle. In Calypso, the master can re-assign already started routines to idle workers (once all routines have been assigned at least once), since TIES guarantees an idempotent, exactly-once semantics of routine execution. This reassignment implies that worker failures can be masked, and that slow machines do not stall a computation, since eager scheduling entails automatic load balancing. Even intermittently available machines can be used by Calypso.