Possible extensions - Responsive Execution of Parallel Programs in Distributed Computing Enviro

The analysis presented here suffers in its most general form from the high numerical complexity of the solu-tion. It might prove interesting to consider advanced numerical techniques (e.g., Monte-Carlo approaches) to solve these integrals. However, such approaches might turn out to be very similar to simulations.

The solution can also be used to derive characteristic moments of the runtime distribution (such as average or standard deviation) for eager scheduling and compare it with other scheduling schemes. The analysis technique could also prove beneficial for investigations of other scheduling mechanisms.

An integration of these results in a resource management scheme such as the one described in [24], along with a simple description of the nature of the Calypso application (e.g., along the lines of the model discussed in Section 4.3), would allow the resource management system to make more informed decisions about the

consequences of adding or taking away resources from a particular program. Ideally, this description could include, along with information about the program itself and its deadline, precomputed values of the program’s responsiveness with a varying number of resources. Similarly, tunable Calypso programs as described by CHANG et al. [49] could use this information to introspectively adapt their control flow to responsiveness requirements. Ideally, these two concepts should be integrated for maximum flexibility.

If I am given a formula, and I am ignorant of its mean-ing, it cannot teach me anythmean-ing, but if I already know it what does the formula teach me?

– St. Augustine

Chapter 6

Checkpointing for Responsiveness

Checkpointing as a fault-tolerance mechanism for responsiveness is considered in this chapter. An analysis of the checkpointing interval problem is presented that optimizes the responsiveness of a service under given as-sumptions. This analysis is then used to derive checkpointing intervals for a Calypso version that checkpoints the master process.

6.1 Introduction

Any single point of failure like the master process of a Calypso program can be a serious obstacle to achieving high responsiveness. In this chapter, it is studied how checkpointing can be used to ameliorate this problem.

Checkpointing is a widely used and well researched paradigm to improve fault tolerance. At certain times, the state of the system is written from volatile memory to stable memory (a checkpoint). In case a fault occurs (and is detected) a rollback recovery step takes place: The most recent checkpoint is restored back into main memory and the system resumes execution from this state.

Application areas for checkpointing are, e.g., transaction oriented database systems or long-running par-allel applications [33]. Often, checkpointing optimization focuses on increasing the availability of a system or decreasing the mean response time of a service. Such optimizations are straightforward problem as long as no deadlines are considered (and a large body of knowledge is available related to this, see Section 6.2); adding deadlines and using responsiveness as evaluation metric makes this problem somewhat unusual, since it is no longer clear how often a checkpoint should be taken. The number of checkpoints to be taken during service execution is then the controlled parameter of an optimization problem, where service and system parameters (e.g., execution time, time to restart a process, or deadline) are given as independent variables.

As a basis for this optimization problem, a general model for a service under checkpointing is described in Section 6.3, along with an appropriate fault detection scheme. This fault detection is neither assumed to be immediate (which would not be realistic, in particular if mechanisms like a remote watchdog is used) nor perfect. In Section 6.4, an analysis of the problem of finding an optimal checkpointing interval for a service with a deadline is presented; some evaluations of this theoretical model are shown in Section 6.5. The theoretical analysis and evaluations for services with a fixed execution time has been performed in joint work with M. Werner; details can be found in [138, 139].

Adding checkpointing to Calypso, on the basis of this analysis, presents its own set of challenges. Check-pointing in parallel systems is usually complicated by the need to ensure consistency when distributed pro-cesses checkpoint their state. In Calypso, this is not the case: Only the master process has to write a check-point, since worker failures are handled by eager scheduling. Therefore, local checkpointing can be used, which makes checkpointing an even more attractive mechanism. Implementation issues of checkpointing a Calypso master are discussed in Section 6.6, where some experimental results are shown, too. The chapter is

concluded with Section 6.7 and possibilities for future work are outlined in Section 6.8.

6.2 Related work

Checkpointing in general is a widely researched area. A general overview of backward recovery methods can be found in [8]. In particular, the problem of choosing checkpointing intervals is of big practical importance and has received appropriate attention.

In a classic short paper, YOUNG [310] gives a first order approximation to the optimum checkpointing interval. The main limitation of this approach is that errors are not allowed to occur during error recov-ery. CHANDYet al. [48] investigate transaction-oriented systems with fixed or cyclically varying transaction request rates. Neither of these papers is concerned with real-time properties.

SHINet al. [255] consider the problem of using checkpointing for real-time tasks if only imperfect fault detection mechanisms are available. Their optimization goal is the mean task execution time with the addi-tional constraint that the probability of an unreliable result must be kept smaller than a prespecified value.

They describe an algorithm for finding optimal placements of checkpoints to solve this problem. A partic-ularly interesting result of this work is the fact that for imperfect fault detection mechanisms, equidistant checkpoints are only a suboptimal choice for their optimization goal. But since most available checkpointing packages (like the one presented in [298]) are based on equidistant checkpointing (unless checkpointing is directed by the programmer, usually with different considerations in mind), equidistant placement of check-points is a more realistic assumption. Furthermore, while SHINet al. investigate real-time tasks, they ignore deadlines. As is shown later (see Section 6.5), the deadline does have a significant impact on the choice of an optimal checkpointing interval.

For performance reasons, many real checkpointing packages (like, e.g., [298]) typically do not wait for the completion of checkpoints. VAIDYA[286] uses Markov models to investigate the tradeoffs between check-point latency and overhead. For equidistant checkcheck-points, the optimal checkcheck-pointing interval is shown to be typically independent of the checkpoint latency. This result allows to ignore the impact of checkpointing latency in the following analysis. Moreover, in a real-time context, implementations with considerable check-pointing latency are usually undesirable since they can lead to substantial unpredictability.

KRISHNAet al. [154] acknowledge the need for evaluation criteria for real-time checkpointing other than mean execution time; they introduce a cost measure for checkpointing in a distributed system. They also provide a first approximation to an optimization that takes costs for both the user of the checkpointed service and other users of the system into account. Although this cost metric is more flexible than the responsiveness metric used here, it requires an application-dependent cost function, which can be undesirable.

The work most closely related to the analysis presented here is by GRASSI et al. [94]. They investigate both a system-oriented and service-oriented view of checkpointing and give a Laplace-Stieltjes transform of the probability distribution of the overhead caused by rollback recovery. The major difference to the approach presented here is that they consider immediate fault detection—the model described below only assumes detection by acceptance tests at discrete times—and they do not consider the problem of imperfect fault detection. Additionally, since their results are in Laplace transform, they are somewhat cumbersome to use—

here results are obtained in the time domain and therefore do not require any inverse transformations. Similar arguments are true for the work by GEIST et al. [86]: While this paper arrives at very elegant solutions, it also does so by assuming immediate and perfect fault detection. Also, neither of these papers addresses the question of a stochastically described execution time.

For checkpointing in distributed systems, ELNOZAHY et al. [75] give an overview of the many possible methods. For the Calypso case, however, distributed checkpointing is not relevant since here only the single master process has to be checkpointed.

Im Dokument Responsive Execution of Parallel Programs in Distributed Computing Environments (Seite 93-97)