Focus on fault tolerance - Responsive Execution of Parallel Programs in Distributed Computing E

Dealing with the inherent imperfections of any real hardware or software is one of the original problems of fault tolerance. A real component is always susceptible to failure, which endangers the correct functioning of a system as a whole. System failures are often unacceptable, given the immense importance of today’s computer applications where large amounts of money or even lives can depend on the correct functioning of these applications (examples can be found in [85, 101]). Therefore, mechanisms to handle such failures are necessary. And as GRAYand SIEWIOREK[95] point out, the larger a system is, the more important is its high availability—but also the less likely it is to actually be highly available owing to its sheer size and complexity.

Devising such fault-tolerance mechanisms requires research in a number of different areas, which are briefly discussed in the remainder of this section, along with a number of examples—overviews and introduc-tions can be found in, e.g, [8, 62, 95, 231, 259, 260, 280].

The basis for fault tolerance is formed by precise models of a system and the possible faults it can experi-ence. A fundamental issue is the distinction between a fault, which is the ultimate cause of any misfunction but only creates the latent potential for it; an error, which is an undesired circumstance internal to the sys-tem and caused by a fault; and finally a failure, which is a user-observed deviation of the syssys-tem from its specification, caused by (one or more) errors [159, 160]. Sometimes, only faults and failures are considered [148]. Faults, errors, or failures, can then be further classified according to their nature, e.g., a machine can crash or a program can compute wrong results. Another classification of faults is according to the time a fault occured: e.g., when designing a system or during its deployment. A number of such fault models have been proposed (cp., e.g., [62, 161, 163, 229]). Also important is the level where a fault occurs: hardware, operating system, application programs, and so on. Today, hardware is a relatively minor cause of system failures when

3.3. FOCUS ON FAULT TOLERANCE

compared to software or environment-related causes [95].

Models of a system’s behavior can also be used to evaluate parameters of a system: reliability evaluation is an example [40], where reliability at time ^tis the probability that a system works correctly in the entire interval of time^[0;^t], provided it worked correctly at time⁰. Typical assumptions for such model evaluations are the reliability of individual components and stochastic properties like independence of faults. An analysis somewhat similar to reliability analysis is undertaken in Section 5 for some aspects of the Calypso system.

A basic technique to compensate for potential errors is additional redundancy, either in space or time.

Redundancy in space can mean, e.g., duplicating processing elements, redundancy in time can be achieved by, e.g., repeatedly executing an algorithm on one machine and comparing the results of the executions. Usually, neither of these kinds of redundancy appears in isolation and it can be applied at different levels of a system.

Finally, different forms of redundancy, like hot or cold standby, can be distinguished.

Another characteristic of fault-tolerance mechanisms is whether they are active before or after an error occurs: forward error recovery tries to deal with errors even before they happen, while backward error recovery only takes minimal steps before error occurrence and repairs the damage afterwards. Again, both mechanisms are usually mixed to some degree.

On the basis of such models and techniques, algorithms can be developed that tolerate faults. A typical class of such algorithms are consensus-based algorithms [28], where a number of active entities, e.g., pro-cessors, have to reach a common accord on some data items, even in the presence of faulty or malicious processors. Consensus-based frameworks for fault tolerance have been proposed [190, 191], and consensus is at the center of the CORE system, described more closely in Section 3.5.4.

Linked with the question of algorithms is that of data representation. Again, the principle of redundancy can be efficiently used by representing data with a coding that has enough information to protect against faults.

Perhaps the simplest example is parity: for each data word, a single bit is additionally stored that encodes whether the number of set bits in this word is odd or even. Parity is generalized by Cyclic Redundancy Check (CRC), which adds a few check bits to a message and allows detection of corrupted messages with low overhead (easily implemented in hardware) and very high probability.

Detecting errors is not always as easy as with CRC; it highly depends on the desired fault model. Su-pervising processors with additional watchdog hardware or software is one example; acceptance, timing or plausibility checks are others [8]. In a distributed system, detection can be done by multiple machines, di-agnosing each other. Here the problem occurs how to diagnose which processors are actually faulty when contradictory diagnosis results are found—this has first been addressed by PREPARATAet al. [232] (see [194]

for an overview of system diagnosis). Somewhat similar problems arise in testing a system before deployment.

After an error has been detected (and, if necessary, properly diagnosed), some actions must be taken to compensate for it. One such measure is rollback recovery, where the system returns to a previous state that is considered to be correct. Checkpointing is a way to implement such a rollback step: the system periodically writes state information to stable storage and, upon detecting an error, reads in such a checkpoint and resumes processing from this point onwards, effectively retrying the previous execution sequence. One problem here is how to choose the interval of writing checkpoints so as to optimize a desired metric. In Chapter 6, an analysis of the checkpointing interval problem for optimizing responsiveness is presented, along with additional related work.

All these techniques and mechanisms together have one main objective: to ensure failure-free, continuous service of a system. But no single mechanism can achieve this in isolation; they have to be integrated in a complete system design. The following sections discuss a few exemplary systems, as well as some paradigms mentioned above in more detail.

3.3.1 Custom-build systems

Software Implemented Fault Tolerance—SIFT

Software Implemented Fault Tolerance (SIFT) [299] is an operating system designed for use in flight-critical functions in commercial aircrafts. Such applications require a very high reliability. SIFT’s approach is to

use standard, simple hardware and implement most of the fault tolerance in software. The basic design is a number of star-connected processors, where the interconnection network is used to broadcast messages to all peer processors. SIFT provides services such as scheduling, synchronization, consistency, communications, fault masking and reconfiguration.

The main abstraction in SIFT is the task, the unit of computation. Tasks are scheduled at precomputed times, uniformly on all processors. Data produced by tasks is broadcast to all other processors, and then a special voter task compares the results of an application task using majority voting. Processors with deviat-ing results are marked as faulty, and a reconfiguration is initiated to exclude such a processor from further processing.

This software-based voting permits a much looser clock synchronization than needed for hardware-implemented voting schemes (by about three orders of magnitude). SIFT proposed a novel clock synchronization mech-anism that achieves the desired precision while being able to mask faulty clocks. One goal of SIFT was to simplify the correctness proofs of the design of a fault-tolerant system. However, the overhead introduced by software fault tolerance turned out to be very high: SIFT was found to use up to 60% of processing time for internal functions [280, p. 397].

Error Resistant Interactively Consistent Architecture—ERICA

Unlike SIFT, the Error Resistant Interactively Consistent Architecture (ERICA) [288] concentrates on hard-ware solutions to build a computer that behaves correctly despite hardhard-ware failures. In ERICA, a proces-sor/memory module is replaced byⁿprocessors and^k times the memory. The memory is spread over theⁿ processors, so that one module now has^k=ntimes the original amount of memory. Access to memory happens via a special encoder/decoder logic: using an appropriate^(n;^k)code, memory content is stored redundantly.

Given large enough values ofⁿand^k, failures of entire modules can be tolerated—this is the case forⁿ⁼⁴ and ^k ⁼ ². Since the hardware hides the redundant memory and replicated processors from the software, only minimal modifications to operating system or applications are necessary. Indeed, an operating system developed for a non-redundant machine was used unaltered on a redundant version. It is a particular strength of ERICA that any architecture can be systematically transformed in a redundant counterpart, provided the hardware behaves deterministically.

The values ofⁿand^k determine system reliability and cost. Reliability improvement is here measured as the ratio between mean time to repair of the redundant and the non-redundant system, cost ratio is defined analogously. Typical approaches like double redundancy (ⁿ⁼^k ⁼²) or triple modular redundancy (ⁿ⁼³,

k =1) are shown to deliver only modest reliability improvements [288]. In ERICA, the^(4;²⁾-concept is used, which provides high reliability improvements at costs comparable to triple modular redundancy. Other designs exceed the reliability of a^(4;²⁾system, but pay a higher premium in cost. Additionally, four processors is the smallest possible number for which the Byzantine generals problem can be solved; in ERICA, a special chip implements a protocol for solving it.

The particular strength of ERICA is the transparent implementation of fault tolerance, combined with a systematic approach to construct a redundant system architecture. Theoretical results concerning ^(n;^k) -coding strengthen the case made for ERICA.

3.3.2 Group communication

One popular approach to fault tolerance is to use replicated entities or objects. Replication often results in a consistency problem, and the process group approach has been suggested [34] to address it with a simple programming interface by providing a single-object view on a group of objects.

The abstract idea of the process group approach is implemented by group communication protocols. To do so, a group communication protocol has to solve a number of problems, namely providing consistent semantics of message delivery and the management of group membership. The consistency of a group communication protocol is characterized by the messages order characteristics that it guarantees. A clear definition of such message orders is necessitated by the nature of many-to-many communication. For example, two messages

3.3. FOCUS ON FAULT TOLERANCE

sent by two senders can be observed in different orders by two receivers. Depending on the application semantics, this may or may not be acceptable. A group communication protocol ensures uniform behavior in such cases with a number of possible guarantee levels: Reliable message delivery is usually required from a group communication protocol, it could guarantee that messages are always received in the order they were sent (FIFO ordering); it could also guarantee that all potential causal dependencies between messages are respected by message receivers (causal ordering); or all processors could deliver all messages in exactly the same order (total order property).

In the presence of node failures, the notion of reliable delivery is also somewhat more complicated than it is the case for unicast communication. Additionally, group communication protocols used for fault tolerance should be fault-tolerant themselves. Many protocols have been designed to cope with different fault scenarios;

an overview of issues in group communication can be found, e.g., in [60, 99].

One well known example of a group communication system is ISIS [34], later redesigned as Horus [289].

Other systems include the global sequencer [124], a 3-phase commit protocol [186], Transis [67], or the Totem protocol [204], which are discussed in more detail in Section 7.4.

With regard to responsiveness, the behavior of group communication protocols in real time is of interest.

While many protocols have a well defined logical semantics, even in the presence of faults, only little attention has been paid to real-time aspects. An experimental investigation of this question for the Totem protocol is presented in Section 7.4.

3.3.3 Cluster-based availability

A number of projects aim at using clusters to provide increased availability for services (PFISTER [223]

gives an overview). The prospect is a tempting one indeed: where traditional commercial fault-tolerance architectures (such as Tandem’s Integrity systems [119]) have struggled to incorporate redundancy in a single machine, today machines are cheap enough to use an entire system as a unit of redundancy—difficult hardware questions like hot-pluggable CPUs are a non-issue when an entire machine can be plugged in and out, with software taking care of consistency. In a similar vein are systems that attempt to present a cluster as a single machine; examples include Solaris MC [144, 256] or the single system image proposal for UnixWare from Compaq [297]. However, these systems deal with failures only superficially (e.g., while Solaris MC survives the crash of any machine, processes running on this machine are simply lost).

Hence, there remain open questions. Consistency is one of them, efficiency another. Transparency to clients is another obvious necessity. A few cluster-based projects aiming at increased service availability are discussed in the following sections.

SunSCALR

SINGHAI et al. [262] propose SunSCALR, a design for a highly available, scalable, and inexpensive server for internet-related services such as the WWW. The design consists of a cluster of workstations with standard UNIX operating systems. A specific service, such as a WWW server, is associated with a group of machines from this cluster. If any machine in this cluster fails (fail-stop behavior is assumed), its IP address is assigned to another machine (selected with a leader election protocol, peer hosts and routers are informed of the new location of this IP address). The service is then restarted on this new host (similarly if only a service, but not the entire machine fails). This is called IP failover and is the core mechanism for scalability and availability.

Failures are detected with heartbeat messages: every host cyclically broadcast an alive message; if more than a certain number of these messages are lost, a host is deemed to have failed. Additionally, this heartbeat message can also be used to communicate load information and balance the load of individual servers by temporarily reassigning IP addresses. IP failover also allows a simple integration of additional or repaired machines, implying on-line scalability.

This scheme allows a very simple and efficient implementation (failover latency in SunSCALR is around

10s [262]) of a highly available distributed server for many applications. Since it is IP based (and not based on distributed name servers like some other proposals with similar intent), even clients that cache server address

mappings do not observe the failure of a server machine in between service requests. However, this is only true if the application is stateless (like, e.g., WWW) or is capable to handle restarted servers. This is usually the case if an application can reissue service requests multiple times without changing the semantics, i.e., if it is idempotent. Hence, SunSCALR is not fully transparent, but closely matches important applications from the Internet context.

Wolfpack

The Microsoft Corp. recently introduced a clustering extension to their popular Windows NT operating system called “Windows NT Clustering Service” [85, 257, 292], also known as “Wolfpack”. Main concern of Wolf-pack is to improve availability of servers during hardware and/or software failure. Other goals are increased scalability—which is somewhat debatable considering that in the original Wolfpack version only two nodes can be used—and better management functionality. Applications that use such a server are presented with the illusion of a single, powerful, and highly available machine.

Wolfpack uses four abstractions to structure its approach: nodes, resources, resource dependencies, and resource groups. A resource is the basic unit of management like a disk or an IP address. Resources can be bundled into logical groups that are managed as a single entity and also form the unit of migration between nodes.

Wolfpack clusters are based on the “shared nothing” principle: Any resource available in the cluster is owned by exactly one node. In case of failure of this node, the clustering software detects this failure (by means of a simple heartbeat mechanism) and then moves all resources owned by this node to another system in the cluster (“failover”). A software resource has to be restarted by the cluster service. Resources can also be explicitly pulled from or pushed to some nodes. However, such a migration of working resources results in temporary service outage.

The existence of such service outages (either due to voluntary migration or failover) implies that the access to cluster-based resources is not completely transparent for clients that do have a state: they must reconnect to this server after failover has been completed. For such applications, Wolfpack provides a semantics similar to that of monolithic systems that employ restart mechanisms. This situation is even less convenient with intermediary software layers (e.g., database engines) that transparently perform the reconnection, but loose application state.

Comparing Wolfpack with SunSCALR shows that the simpler mechanisms of the latter do not necessarily impede its functionality. SunSCALR’s IP-based mechanisms give it the same level of transparency and better scalability than Wolfpack’s somewhat complicated architecture. Also, Wolfpack is tightly coupled to one particular operating system. Both systems employ a cold standby approach and have to restart software services after failures are detected. It remains to be seen how the development of Wolfpack will proceed.

NonStop Cluster Application Protection System—NCAPS

Tandem’s NonStop Cluster Application Protection System (NCAPS) [162] shall serve as a last example for cluster-based high availability solutions. NCAPS leverages the high performance of a ServerNet-based fault-tolerant architecture [17] to build a simple programming environment for improving the availability of appli-cations running on UNIX clusters. Unlike Wolfpack, it uses a warm standby approach: an application process is accompanied by a backup process (running on a different node) that is in an idle state and takes over when the primary process fails. That backup process does not provide service, it is only initialized. The backup process allows very fast failover (on the order of¹⁰s), as opposed to Wolfpack’s restart approach (application start and initialization can take up to 20 minutes in an example in [162]).

To implement this, a number of small services is used: a heartbeat-based node monitor, a keep-alive service that restarts failed applications, and, at NCAPS’s core, a Process Pair Manager (PPM). This PPM is responsible for detecting application failure, promoting a backup process to primary status, and starting another backup. An application has to be linked with a special library to work with the PPM.

Im Dokument Responsive Execution of Parallel Programs in Distributed Computing Environments (Seite 48-53)